[ceph-users] Re: AssumeRoleWithWebIdentity in RGW with Azure AD
Hi Ryan, This appears to be a known issue and is tracked here: https://tracker.ceph.com/issues/54562. There is a workaround mentioned in the tracker that has worked and you can try that. Otherwise, I will be working on this 'invalid padding' problem very soon. Thanks, Pritha On Tue, Jul 9, 2024 at 1:16 AM Ryan Rempel wrote: > I'm trying to setup the OIDC provider for RGW so that I can have roles > that can be assumed by people logging into their regular Azure AD > identities. The client I'm planning to use is Cyberduck – it seems like one > of the few GUI S3 clients that manages the OIDC login process in a way that > could work for relatively naive users. > > I've gotten a fair ways down the road. I've been able to configure > Cyberduck so that it performs the login with Azure AD, gets an identity > token, and then sends it to Ceph to engage with the > AssumeRoleWithWebIdentity process. However, I then get an error, which > shows up in the Ceph rgw logs like this: > > 2024-07-08T17:18:09.749+ 7fb2d7845700 0 req 15967124976712370684 > 1.284013867s sts:assume_role_web_identity Signature validation failed: evp > verify final failed: 0 error:0407008A:rsa > routines:RSA_padding_check_PKCS1_type_1:invalid padding > > I turned the logging for rgw up to 20 to see if I could follow along to > see how much of the process succeeds and learn more about what fails. I can > then see logging messages from this file in the source code: > > > https://github.com/ceph/ceph/blob/08d7ff952d78d1bbda04d5ff7e3db1e733301072/src/rgw/rgw_rest_sts.cc > > We get to WebTokenEngine::get_from_jwt, and it logs the JWT payload in a > way that seems to be as expected. The logs then indicate that a request is > sent to the /.well-known/openid-configuration endpoint that appears to be > appropriate for the issuer of the JWT. The logs eventually indicate what > looks like a successful and appropriate response to that. The logs then > show that a request is sent to the jwks_uri that is indicated in the > openid-configuration document. The response to that is logged, and it > appears to be appropriate. > > We then get some logging starting with "Certificate is", so it looks like > we're getting as far as WebTokenEngine::validate_signature. So, several > things appear to have happened successfully – we've loading the OIDC > provider that corresponds to the iss, and we've found a client ID that > corresponds to what I registered when I configured things. (This is why I > say we appear to be a fair ways down the road – a lot of this is working). > > It looks as though what's happening in the code now is that it's iterating > through the certificates given in the jwks_uri content. There are 6 > certificates listed, but the code only gets as far as the first one. > Looking at the code, what appears to be happening is that, among the > various certificates in the jwks_uri, it's finding the first one which > matches a thumbprint registered with Ceph (that is, which I registered with > Ceph). This must be succeeding (for the first certificate), because the > "Signature validation failed" logging comes later. So, the code does verify > that the thumbprint of the first certificate matches one of the thumbprints > I registered with Ceph for this OIDC provider. > > We then get to a part of the code where it tries to verify the JWT using > the certificate, with jwt::verify. Given what gets logged ("Signature > validateion failed: ", this must be throwing an exception. > > The thing I find surprising about this is that there really isn't any > reason to think that the first certificate listed in the jwks_uri content > is going to be the certificate used to sign the JWT. If I understand JWT > correctly, it's appropriate to sign the JWT with any of the certificates > listed in the jwks_uri content. Furthermore, the JWT header includes a > reference to the kid, so it's possible for Ceph to know exactly which > certificate the JWT purports to be signed by. And, Ceph knows that there > might be multiple thumbprints, because we can register 5. So, the logic of > trying the first valid certificate in x5c and then stopping if it fails > seems broken, actually. > > I suppose what I could do as a workaround is try to figure out whether > Azure AD is consistently using the same kid to sign the JWTs for me, and > then only register that thumbprint with Ceph. Then, Ceph would actually > choose the correct certificate (as the others wouldn't match a thumbprint I > registered). I may try this – in part, just to verify what I think is > happening. But it would be awfully fragile – I don't believe there is any > requirement in JWT to just use one of the certificates listed in x5c. > > An alternative would be to try rewriting the code to apply a different > kind of logic. The way it ought to work (it seems to me) is something like > this: > > > * > Get the openid_configuration, and get the jwks stuff from the jwks_uri > (which Ceph does already). > * > Look at
[ceph-users] AssumeRoleWithWebIdentity in RGW with Azure AD
I'm trying to setup the OIDC provider for RGW so that I can have roles that can be assumed by people logging into their regular Azure AD identities. The client I'm planning to use is Cyberduck – it seems like one of the few GUI S3 clients that manages the OIDC login process in a way that could work for relatively naive users. I've gotten a fair ways down the road. I've been able to configure Cyberduck so that it performs the login with Azure AD, gets an identity token, and then sends it to Ceph to engage with the AssumeRoleWithWebIdentity process. However, I then get an error, which shows up in the Ceph rgw logs like this: 2024-07-08T17:18:09.749+ 7fb2d7845700 0 req 15967124976712370684 1.284013867s sts:assume_role_web_identity Signature validation failed: evp verify final failed: 0 error:0407008A:rsa routines:RSA_padding_check_PKCS1_type_1:invalid padding I turned the logging for rgw up to 20 to see if I could follow along to see how much of the process succeeds and learn more about what fails. I can then see logging messages from this file in the source code: https://github.com/ceph/ceph/blob/08d7ff952d78d1bbda04d5ff7e3db1e733301072/src/rgw/rgw_rest_sts.cc We get to WebTokenEngine::get_from_jwt, and it logs the JWT payload in a way that seems to be as expected. The logs then indicate that a request is sent to the /.well-known/openid-configuration endpoint that appears to be appropriate for the issuer of the JWT. The logs eventually indicate what looks like a successful and appropriate response to that. The logs then show that a request is sent to the jwks_uri that is indicated in the openid-configuration document. The response to that is logged, and it appears to be appropriate. We then get some logging starting with "Certificate is", so it looks like we're getting as far as WebTokenEngine::validate_signature. So, several things appear to have happened successfully – we've loading the OIDC provider that corresponds to the iss, and we've found a client ID that corresponds to what I registered when I configured things. (This is why I say we appear to be a fair ways down the road – a lot of this is working). It looks as though what's happening in the code now is that it's iterating through the certificates given in the jwks_uri content. There are 6 certificates listed, but the code only gets as far as the first one. Looking at the code, what appears to be happening is that, among the various certificates in the jwks_uri, it's finding the first one which matches a thumbprint registered with Ceph (that is, which I registered with Ceph). This must be succeeding (for the first certificate), because the "Signature validation failed" logging comes later. So, the code does verify that the thumbprint of the first certificate matches one of the thumbprints I registered with Ceph for this OIDC provider. We then get to a part of the code where it tries to verify the JWT using the certificate, with jwt::verify. Given what gets logged ("Signature validateion failed: ", this must be throwing an exception. The thing I find surprising about this is that there really isn't any reason to think that the first certificate listed in the jwks_uri content is going to be the certificate used to sign the JWT. If I understand JWT correctly, it's appropriate to sign the JWT with any of the certificates listed in the jwks_uri content. Furthermore, the JWT header includes a reference to the kid, so it's possible for Ceph to know exactly which certificate the JWT purports to be signed by. And, Ceph knows that there might be multiple thumbprints, because we can register 5. So, the logic of trying the first valid certificate in x5c and then stopping if it fails seems broken, actually. I suppose what I could do as a workaround is try to figure out whether Azure AD is consistently using the same kid to sign the JWTs for me, and then only register that thumbprint with Ceph. Then, Ceph would actually choose the correct certificate (as the others wouldn't match a thumbprint I registered). I may try this – in part, just to verify what I think is happening. But it would be awfully fragile – I don't believe there is any requirement in JWT to just use one of the certificates listed in x5c. An alternative would be to try rewriting the code to apply a different kind of logic. The way it ought to work (it seems to me) is something like this: * Get the openid_configuration, and get the jwks stuff from the jwks_uri (which Ceph does already). * Look at the header of the JWT to see which kid it purports to be signed by. * Find the certificate that corresponds to that kid (from the jwks_uri content) * Validate the JWT with that certificate. That ought to work, at least given what I'm seeing. (But, I'm not a JWT expert, so I don't know whether there is something unusual in how Azure AD generates JWT's and handles the jwks_uri content). Anyway, I'm curious whether anyone else
[ceph-users] Re: Fixing BlueFS spillover (pacific 16.2.14)
Hello, I just wanted to share that the following command also helped us move slow used bytes back to the fast device (without using bluefs-bdev-expand), when several compactions couldn't: $ cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source /var/lib/ceph/osd/ceph-${osd}/block --dev-target /var/lib/ceph/osd/ceph-${osd}/block.db slow_used_bytes is now back to 0 on perf dump and BLUEFS_SPILLOVER alert got cleared but 'bluefs stats' is not on par: $ ceph tell osd.451 bluefs stats 1 : device size 0x1effbfe000 : using 0x30960(12 GiB) 2 : device size 0x746dfc0 : using 0x3abd77d2000(3.7 TiB) RocksDBBlueFSVolumeSelector Usage Matrix: DEV/LEV WAL DB SLOW* * REAL FILES LOG 0 B 22 MiB 0 B 0 B 0 B 3.9 MiB 1 WAL 0 B 33 MiB 0 B 0 B 0 B 32 MiB 2 DB 0 B 12 GiB 0 B 0 B 0 B 12 GiB 196 SLOW0 B 4 MiB 0 B 0 B 0 B 3.8 MiB 1 TOTAL 0 B 12 GiB 0 B 0 B 0 B 0 B 200 MAXIMUMS: LOG 0 B 22 MiB 0 B 0 B 0 B 17 MiB WAL 0 B 33 MiB 0 B 0 B 0 B 32 MiB DB 0 B 24 GiB 0 B 0 B 0 B 24 GiB SLOW0 B 4 MiB 0 B 0 B 0 B 3.8 MiB TOTAL 0 B 24 GiB 0 B 0 B 0 B 0 B >> SIZE << 0 B 118 GiB 6.9 TiB Any idea? Is this something to worry about? Regards, Frédéric. - Le 16 Oct 23, à 14:46, Igor Fedotov igor.fedo...@croit.io a écrit : > Hi Chris, > > for the first question (osd.76) you might want to try ceph-volume's "lvm > migrate --from data --target " command. Looks like some > persistent DB remnants are still kept at main device causing the alert. > > W.r.t osd.86's question - the line "SLOW 0 B 3.0 GiB > 59 GiB" means that RocksDB higher levels data (usually L3+) are spread > over DB and main (aka slow) devices as 3 GB and 59 GB respectively. > > In other words SLOW row refers to DB data which is originally supposed > to be at SLOW device (due to RocksDB data mapping mechanics). But > improved bluefs logic (introduced by > https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage > for a part of this data. > > Resizing DB volume and following DB compaction should do the trick and > move all the data to DB device. Alternatively ceph-volume's lvm migrate > command should do the same but the result will be rather temporary > without DB volume resizing. > > Hope this helps. > > > Thanks, > > Igor > > On 06/10/2023 06:55, Chris Dunlop wrote: >> Hi, >> >> tl;dr why are my osds still spilling? >> >> I've recently upgraded to 16.2.14 from 16.2.9 and started receiving >> bluefs spillover warnings (due to the "fix spillover alert" per the >> 16.2.14 release notes). E.g. from 'ceph health detail', the warning on >> one of these (there are a few): >> >> osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of >> 60 GiB) to slow device >> >> This is a 15T HDD with only a 60G SSD for the db so it's not >> surprising it spilled as it's way below the recommendation for rbd >> usage at db size 1-2% of the storage size. >> >> There was some spare space on the db ssd so I increased the size of >> the db LV up over 400G and did an bluefs-bdev-expand. >> >> However, days later, I'm still getting the spillover warning for that >> osd, including after running a manual compact: >> >> # ceph tell osd.76 compact >> >> See attached perf-dump-76 for the perf dump output: >> >> # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq >> -r '.bluefs' >> >> In particular, if my understanding is correct, that's telling me the >> db available size is 487G (i.e. the LV expand worked), of which it's >> using 59G, and there's 128K spilled to the slow device: >> >> "db_total_bytes": 512309059584, # 487G >> "db_used_bytes": 63470305280, # 59G >> "slow_used_bytes": 131072, # 128K >> >> A "bluefs stats" also says the db is using 128K of slow storage >> (although perhaps it's getting the info from the same place as the >> perf dump?): >> >> # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using >> 0xea620(59 GiB) >> 2 : device size 0xe8d7fc0 : using 0x6554d689000(6.3 TiB) >> RocksDBBlueFSVolumeSelector Usage Matrix: >> DEV/LEV WAL DB SLOW * * >> REAL FILES LOG 0 B 10 MiB 0 >> B 0 B 0 B 8.8 MiB 1 WAL 0 >> B 2.5 GiB 0 B 0 B 0 B
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Dhairya, Thank you ever so much for having another look at this so quickly. I don't think I have any logs similar to the ones you referenced this time as my MDSs don't seem to enter the replay stage when they crash (or at least don't now after I've thrown the logs away) but those errors do crop up in the prior logs I shared when the system first crashed. Kindest regards, Ivan On 08/07/2024 14:08, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson wrote: Hi Dhairya, Sorry to resurrect this thread again, but we still unfortunately have an issue with our filesystem after we attempted to write new backups to it. We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again. We re-started the backups to run over the weekend and unfortunately the filesystem crashed again where the log of the failure is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. We ran the backups on kernel mounts of the filesystem without the nowsync option this time to avoid the out-of-sync write problems.. I've tried resetting the journal again after recovering the dentries but unfortunately the filesystem is still in a failed state despite setting joinable to true. The log of this crash is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708. I'm not sure how to proceed as I can't seem to get any MDS to take over the first rank. I would like to do a scrub of the filesystem and preferably overwrite the troublesome files with the originals on the live filesystem. Do you have any advice on how to make the filesystem leave its failed state? I have a backup of the journal before I reset it so I can roll back if necessary. Here are some details about the filesystem at present: root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status cluster: id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 health: HEALTH_ERR 1 filesystem is degraded 1 large omap objects 1 filesystem is offline 1 mds daemon damaged nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set 1750 pgs not deep-scrubbed in time 1612 pgs not scrubbed in time services: mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m) mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, pebbles-s4 mds: 1/2 daemons up, 3 standby osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped pgs flags nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim data: volumes: 1/2 healthy, 1 recovering; 1 damaged pools: 7 pools, 2177 pgs objects: 3.24G objects, 6.7 PiB usage: 8.6 PiB used, 14 PiB / 23 PiB avail pgs: 11785954/27384310061 objects misplaced (0.043%) 2167 active+clean 6 active+remapped+backfilling 4 active+remapped+backfill_wait ceph_backup - 0 clients === RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed POOL TYPE USED AVAIL mds_backup_fs metadata 1174G 3071G ec82_primary_fs_data data 0 3071G ec82pool data 8085T 4738T ceph_archive - 2 clients RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active pebbles-s4 Reqs: 0 /s 13.4k 7105 118 2 POOL TYPE USED AVAIL mds_archive_fs metadata 5184M 3071G ec83_primary_fs_data data 0 3071G ec83pool data 138T 4307T STANDBY MDS pebbles-s2 pebbles-s3 pebbles-s1 MDS version: ceph version 17.2.7
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson wrote: > Hi Dhairya, > > Sorry to resurrect this thread again, but we still unfortunately have an > issue with our filesystem after we attempted to write new backups to it. > > We finished the scrub of the filesystem on Friday and ran a repair scrub > on the 1 directory which had metadata damage. After doing so and rebooting, > the cluster reported no issues and data was accessible again. > > We re-started the backups to run over the weekend and unfortunately the > filesystem crashed again where the log of the failure is here: > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. > We ran the backups on kernel mounts of the filesystem without the nowsync > option this time to avoid the out-of-sync write problems.. > > I've tried resetting the journal again after recovering the dentries but > unfortunately the filesystem is still in a failed state despite setting > joinable to true. The log of this crash is here: > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708 > . > > I'm not sure how to proceed as I can't seem to get any MDS to take over > the first rank. I would like to do a scrub of the filesystem and preferably > overwrite the troublesome files with the originals on the live filesystem. > Do you have any advice on how to make the filesystem leave its failed > state? I have a backup of the journal before I reset it so I can roll back > if necessary. > > Here are some details about the filesystem at present: > > root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status > cluster: > id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 > health: HEALTH_ERR > 1 filesystem is degraded > 1 large omap objects > 1 filesystem is offline > 1 mds daemon damaged > > nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set > 1750 pgs not deep-scrubbed in time > 1612 pgs not scrubbed in time > > services: > mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 > (age 50m) > mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, > pebbles-s4 > mds: 1/2 daemons up, 3 standby > osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped > pgs > flags > nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim > > data: > volumes: 1/2 healthy, 1 recovering; 1 damaged > pools: 7 pools, 2177 pgs > objects: 3.24G objects, 6.7 PiB > usage: 8.6 PiB used, 14 PiB / 23 PiB avail > pgs: 11785954/27384310061 objects misplaced (0.043%) > 2167 active+clean > 6active+remapped+backfilling > 4active+remapped+backfill_wait > > ceph_backup - 0 clients > === > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > 0failed > POOLTYPE USED AVAIL >mds_backup_fs metadata 1174G 3071G > ec82_primary_fs_datadata 0 3071G > ec82pool data8085T 4738T > ceph_archive - 2 clients > > RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS > 0active pebbles-s4 Reqs:0 /s 13.4k 7105118 2 > POOLTYPE USED AVAIL >mds_archive_fs metadata 5184M 3071G > ec83_primary_fs_datadata 0 3071G > ec83pool data 138T 4307T > STANDBY MDS > pebbles-s2 > pebbles-s3 > pebbles-s1 > MDS version: ceph version 17.2.7 > (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) > root@pebbles-s2 11:55 [~]: ceph fs dump > e2643889 > enable_multiple, ever_enabled_multiple: 1,1 > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client > writeable ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 1 > > Filesystem 'ceph_backup' (1) > fs_nameceph_backup > epoch2643888 > flags12 joinable allow_snaps allow_multimds_snaps > created2023-05-19T12:52:36.302135+0100 > modified2024-07-08T11:17:55.437861+0100 > tableserver0 > root0 > session_timeout60 > session_autoclose300 > max_file_size109
[ceph-users] Slow osd ops on large arm cluster
Hello, we are having issues with slow ops on our large ARM hpc ceph cluster. Cluster runs on 18.2.0 and ubutnu 20.04 MONs, MGRs and MDSs had to be moved to intel servers because of poor single core performance on our arm servers. Our main cephfs data pool is on 54 serwers in 9 racks with 1458 HDDs in total. (OSDs without block.db on ssd) Cephfs data pool is configured as erasure coded pool with k=6 m=2 and rack level replication. Pool has about 16k PGs with average pg per osd at ~90. We have had good experience with EC cephfs on 3,5 times smaller intel ceph cluster. But this arm deployment is becoming problematic. We started experiencing issues since one of the users started to generate sequential RW traffic at at about 5GiB/s. Single OSD with slow ops was enough to create laggy PG and crash application generating this traffic. We've even had issue where osd with slow ops was lagged for 6 hours and required manual restart. Now we are experiencing slow ops even at much lower read only traffic ~400MiB/s Here is an example of slow ops on OSD: { "ops": [ { "description": "osd_op(client.255949991.0:92728602 4.d22s0 4:44b3390a:::1000b640ddc.039b:head [read 3633152~8192] snapc 0=[] ondisk+read+known_if_redirected e1117246)", "initiated_at": "2024-07-08T10:19:58.469537+", "age": 507.242936848, "duration": 507.2429885483, "type_data": { "flag_point": "started", "client_info": { "client": "client.255949991", "client_addr": "x.x.x.x:0/887459214", "tid": 92728602 }, "events": [ { "event": "initiated", "time": "2024-07-08T10:19:58.469537+", "duration": 0 }, { "event": "throttled", "time": "2024-07-08T10:19:58.469537+", "duration": 0 }, { "event": "header_read", "time": "2024-07-08T10:19:58.469535+", "duration": 4294967295.981 }, { "event": "all_read", "time": "2024-07-08T10:19:58.469571+", "duration": 3.5859e-05 }, { "event": "dispatched", "time": "2024-07-08T10:19:58.469573+", "duration": 2.08e-06 }, { "event": "queued_for_pg", "time": "2024-07-08T10:19:58.469586+", "duration": 1.27210001e-05 }, { "event": "reached_pg", "time": "2024-07-08T10:19:58.485132+", "duration": 0.0155460489 }, { "event": "started", "time": "2024-07-08T10:19:58.485147+", "duration": 1.5161e-05 } ] } }, HDD with this OSD is not busy. Arm cores on these servers are slow but no process reaches full 100% core usage. I think we may have the same issue as one described here: https://www.mail-archive.com/ceph-users@ceph.io/msg13273.html I've tried to reduce osd_pool_default_read_lease_ratio form 0.8 to 0.2 I've tried to reduce osd_heartbeat_grace from 20 to 10. It should lower read_lease_interval from 16 to 2 but it didn't help. Still see a lot of slow ops. Could you give me tips what I could tune to fix this issue? Could this be an issue with large number of EC PGs on large cluster with weak CPUs? Best regards Adam Prycki ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pg's stuck activating on osd create
Hi, it depends a bit on the actual OSD layout on the node and your procedure, but there's a chance you might have hit the overdose. But I would expect it to be logged in the OSD logs, two years ago in a Nautilus cluster the message looked like this: maybe_wait_for_max_pg withhold creation of pg ... According to github in 16.2.15 it could look like this: maybe_wait_for_max_pg hit max pg, dropping ... But I'm not sure, I haven't seen that in newer clusters (yet). Regards, Eugen Zitat von Richard Bade : Hi Everyone, I had an issue last night when I was bringing online some osds that I was rebuilding. When the osds created and came online 15pgs got stuck in activating. The first osd (osd.112) seemed to come online ok, but the second one (osd.113) triggered the issue. All the pgs in activating included osd.112 in the pg map and I resolved it by doing pg-upmap-items to map the pg back from osd.112 to where it currently was but it was painful having 10min of stuck i/o os an rbd pool with vm's running. Some details about the cluster: Pacific 16.2.15, upgraded from Nautilus fairly recently and Luminos back in the past. All osds were rebuilt on bluestore in Nautilus, as were the mons. The disks in question are Intel DC P4510 8TB nvme. I'm rebuilding them as I had previously had 4x2TB osd's per disk and now wanted to consolidate down to one osd per disk. There's around 300 osd's in the pool with 16384 pgs which means that the 2TB osds had 157pgs on them. However this means that the 8TB osds have 615pgs on them and I'm wondering if this is maybe the cause of the problem. There are no warnings about too many pgs per osd in the logs or ceph status. I have the default value of 250 for mon_max_pg_per_osd and default value of 3.0 for osd_max_pg_per_osd_hard_ratio. My plan is to reduce the number of pgs in the pool but I want to understand and prove what happened here. Is it likely I've hit pg overdose protection? If I have, how would I tell as I can't see anything in the cluster logs. Thanks, Rich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RBD Mirror - Failed to unlink peer
Hi, sorry for the delayed response, I was on vacation. I would set the "debug_rbd_mirror" config to 15 (or higher) and then watch the logs: # ceph config set client.rbd-mirror. debug_rbd_mirror 15 Maybe that reveals anything. Regards, Eugen Zitat von scott.cai...@tecnica-ltd.co.uk: Thanks - hopefully I'll hear back from devs then as I can't seem to find anything online about others encountering the same warning, but I surely can't be the only one! Would it be the rbd subsystem I'm looking to increase to debug level 15 or is there another subsystem for rbd mirroring? What would be the best way to enable it (ceph config set client debug_rbd 20 then change back to 0/5 once done)? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Dhairya, Sorry to resurrect this thread again, but we still unfortunately have an issue with our filesystem after we attempted to write new backups to it. We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again. We re-started the backups to run over the weekend and unfortunately the filesystem crashed again where the log of the failure is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. We ran the backups on kernel mounts of the filesystem without the nowsync option this time to avoid the out-of-sync write problems.. I've tried resetting the journal again after recovering the dentries but unfortunately the filesystem is still in a failed state despite setting joinable to true. The log of this crash is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708. I'm not sure how to proceed as I can't seem to get any MDS to take over the first rank. I would like to do a scrub of the filesystem and preferably overwrite the troublesome files with the originals on the live filesystem. Do you have any advice on how to make the filesystem leave its failed state? I have a backup of the journal before I reset it so I can roll back if necessary. Here are some details about the filesystem at present: root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status cluster: id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 health: HEALTH_ERR 1 filesystem is degraded 1 large omap objects 1 filesystem is offline 1 mds daemon damaged nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set 1750 pgs not deep-scrubbed in time 1612 pgs not scrubbed in time services: mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m) mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, pebbles-s4 mds: 1/2 daemons up, 3 standby osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped pgs flags nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim data: volumes: 1/2 healthy, 1 recovering; 1 damaged pools: 7 pools, 2177 pgs objects: 3.24G objects, 6.7 PiB usage: 8.6 PiB used, 14 PiB / 23 PiB avail pgs: 11785954/27384310061 objects misplaced (0.043%) 2167 active+clean 6 active+remapped+backfilling 4 active+remapped+backfill_wait ceph_backup - 0 clients === RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed POOL TYPE USED AVAIL mds_backup_fs metadata 1174G 3071G ec82_primary_fs_data data 0 3071G ec82pool data 8085T 4738T ceph_archive - 2 clients RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active pebbles-s4 Reqs: 0 /s 13.4k 7105 118 2 POOL TYPE USED AVAIL mds_archive_fs metadata 5184M 3071G ec83_primary_fs_data data 0 3071G ec83pool data 138T 4307T STANDBY MDS pebbles-s2 pebbles-s3 pebbles-s1 MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) root@pebbles-s2 11:55 [~]: ceph fs dump e2643889 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'ceph_backup' (1) fs_name ceph_backup epoch 2643888 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-05-19T12:52:36.302135+0100 modified 2024-07-08T11:17:55.437861+0100 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 10993418240 required_client_features {} last_failure 0 last_failure_osd_epoch 494515 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {} failed damaged 0 stopped data_pools [6,3] metadata_pool 2 inline_data disabled balancer standby_count_wanted 1 Kindest regards, Ivan On 28/06/2024 15:17, Dhairya Parmar
[ceph-users] Re: Sanity check
Hi, your crush rule distributes each chunk on a different host, so your failure domain is host. The crush-failure-domain=osd from the EC profile most likely is from the initial creation, maybe it was supposed to be OSD during initial tests or whatever, but the crush rule is key here. We thought we testing this by turning off 2 hosts, we have had one host offline recently and the cluster was still serving clients - did we get lucky? No, you didn't get lucky. By default, an EC pool's min_size is k + 1, which is 7 in your case. You have 8 chunks in total distributed across different hosts, turning off one host results in 7 available chunks, so the pool is still serving clients. If you shut down one more host, the pool will become inactive. Regards, Eugen Zitat von Adam Witwicki : Hello, Can someone please let me know what failure domain my erasure code pool is, osd or host? We thought we testing this by turning off 2 hosts, we have had one host offline recently and the cluster was still serving clients - did we get lucky? ceph osd pool get crush_rule crush_rule: ecpool ceph osd pool get erasure_code_profile erasure_code_profile: 6-2 rule ecpool { id 3 type erasure min_size 3 max_size 10 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step chooseleaf indep 0 type host step emit } ceph osd erasure-code-profile get 6-2 crush-device-class=hdd crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=6 m=2 plugin=jerasure technique=reed_sol_van w=8 octopus Regards Adam ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Pacific 16.2.15 `osd noin`
On 02-04-2024 15:09, Zakhar Kirpichenko wrote: Hi, I'm adding a few OSDs to an existing cluster, the cluster is running with `osd noout,noin`: cluster: id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86 health: HEALTH_WARN noout,noin flag(s) set Specifically `noin` is documented as "prevents booting OSDs from being marked in". But freshly added OSDs were immediately marked `up` and `in`: services: ... osd: 96 osds: 96 up (since 5m), 96 in (since 6m); 338 remapped pgs flags noout,noin # ceph osd tree in | grep -E "osd.11|osd.12|osd.26" 11hdd9.38680 osd.11 up 1.0 1.0 12hdd9.38680 osd.12 up 1.0 1.0 26hdd9.38680 osd.26 up 1.0 1.0 Is this expected behavior? Do I misunderstand the purpose of the `noin` option? We have "mon_osd_auto_mark_new_in = false" configured for this reason. With this configuration option setting OSDs "IN" becomes a manual operation. If you don't want to have OSDs marked in automatically after they have been marked out you can use this option as well: mon_osd_auto_mark_auto_out_in = false. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io