[ceph-users] Re: AssumeRoleWithWebIdentity in RGW with Azure AD

2024-07-08 Thread Pritha Srivastava
Hi Ryan,

This appears to be a known issue and is tracked here:
https://tracker.ceph.com/issues/54562. There is a workaround mentioned in
the tracker that has worked and you can try that. Otherwise, I will be
working on this 'invalid padding' problem very soon.

Thanks,
Pritha

On Tue, Jul 9, 2024 at 1:16 AM Ryan Rempel  wrote:

> I'm trying to setup the OIDC provider for RGW so that I can have roles
> that can be assumed by people logging into their regular Azure AD
> identities. The client I'm planning to use is Cyberduck – it seems like one
> of the few GUI S3 clients that manages the OIDC login process in a way that
> could work for relatively naive users.
>
> I've gotten a fair ways down the road. I've been able to configure
> Cyberduck so that it performs the login with Azure AD, gets an identity
> token, and then sends it to Ceph to engage with the
> AssumeRoleWithWebIdentity process. However, I then get an error, which
> shows up in the Ceph rgw logs like this:
>
> 2024-07-08T17:18:09.749+ 7fb2d7845700  0 req 15967124976712370684
> 1.284013867s sts:assume_role_web_identity Signature validation failed: evp
> verify final failed: 0 error:0407008A:rsa
> routines:RSA_padding_check_PKCS1_type_1:invalid padding
>
> I turned the logging for rgw up to 20 to see if I could follow along to
> see how much of the process succeeds and learn more about what fails. I can
> then see logging messages from this file in the source code:
>
>
> https://github.com/ceph/ceph/blob/08d7ff952d78d1bbda04d5ff7e3db1e733301072/src/rgw/rgw_rest_sts.cc
>
> We get to WebTokenEngine::get_from_jwt, and it logs the JWT payload in a
> way that seems to be as expected. The logs then indicate that a request is
> sent to the /.well-known/openid-configuration endpoint that appears to be
> appropriate for the issuer of the JWT. The logs eventually indicate what
> looks like a successful and appropriate response to that. The logs then
> show that a request is sent to the jwks_uri that is indicated in the
> openid-configuration document. The response to that is logged, and it
> appears to be appropriate.
>
> We then get some logging starting with "Certificate is", so it looks like
> we're getting as far as WebTokenEngine::validate_signature. So, several
> things appear to have happened successfully – we've loading the OIDC
> provider that corresponds to the iss, and we've found a client ID that
> corresponds to what I registered when I configured things. (This is why I
> say we appear to be a fair ways down the road – a lot of this is working).
>
> It looks as though what's happening in the code now is that it's iterating
> through the certificates given in the jwks_uri content. There are 6
> certificates listed, but the code only gets as far as the first one.
> Looking at the code, what appears to be happening is that, among the
> various certificates in the jwks_uri, it's finding the first one which
> matches a thumbprint registered with Ceph (that is, which I registered with
> Ceph). This must be succeeding (for the first certificate), because the
> "Signature validation failed" logging comes later. So, the code does verify
> that the thumbprint of the first certificate matches one of the thumbprints
> I registered with Ceph for this OIDC provider.
>
> We then get to a part of the code where it tries to verify the JWT using
> the certificate, with jwt::verify. Given what gets logged ("Signature
> validateion failed: ", this must be throwing an exception.
>
> The thing I find surprising about this is that there really isn't any
> reason to think that the first certificate listed in the jwks_uri content
> is going to be the certificate used to sign the JWT. If I understand JWT
> correctly, it's appropriate to sign the JWT with any of the certificates
> listed in the jwks_uri content. Furthermore, the JWT header includes a
> reference to the kid, so it's possible for Ceph to know exactly which
> certificate the JWT purports to be signed by. And, Ceph knows that there
> might be multiple thumbprints, because we can register 5. So, the logic of
> trying the first valid certificate in x5c and then stopping if it fails
> seems broken, actually.
>
> I suppose what I could do as a workaround is try to figure out whether
> Azure AD is consistently using the same kid to sign the JWTs for me, and
> then only register that thumbprint with Ceph. Then, Ceph would actually
> choose the correct certificate (as the others wouldn't match a thumbprint I
> registered). I may try this – in part, just to verify what I think is
> happening. But it would be awfully fragile – I don't believe there is any
> requirement in JWT to just use one of the certificates listed in x5c.
>
> An alternative would be to try rewriting the code to apply a different
> kind of logic. The way it ought to work (it seems to me) is something like
> this:
>
>
>   *
> Get the openid_configuration, and get the jwks stuff from the jwks_uri
> (which Ceph does already).
>   *
> Look at 

[ceph-users] AssumeRoleWithWebIdentity in RGW with Azure AD

2024-07-08 Thread Ryan Rempel
I'm trying to setup the OIDC provider for RGW so that I can have roles that can 
be assumed by people logging into their regular Azure AD identities. The client 
I'm planning to use is Cyberduck – it seems like one of the few GUI S3 clients 
that manages the OIDC login process in a way that could work for relatively 
naive users.

I've gotten a fair ways down the road. I've been able to configure Cyberduck so 
that it performs the login with Azure AD, gets an identity token, and then 
sends it to Ceph to engage with the AssumeRoleWithWebIdentity process. However, 
I then get an error, which shows up in the Ceph rgw logs like this:

2024-07-08T17:18:09.749+ 7fb2d7845700  0 req 15967124976712370684 
1.284013867s sts:assume_role_web_identity Signature validation failed: evp 
verify final failed: 0 error:0407008A:rsa 
routines:RSA_padding_check_PKCS1_type_1:invalid padding

I turned the logging for rgw up to 20 to see if I could follow along to see how 
much of the process succeeds and learn more about what fails. I can then see 
logging messages from this file in the source code:

https://github.com/ceph/ceph/blob/08d7ff952d78d1bbda04d5ff7e3db1e733301072/src/rgw/rgw_rest_sts.cc

We get to WebTokenEngine::get_from_jwt, and it logs the JWT payload in a way 
that seems to be as expected. The logs then indicate that a request is sent to 
the /.well-known/openid-configuration endpoint that appears to be appropriate 
for the issuer of the JWT. The logs eventually indicate what looks like a 
successful and appropriate response to that. The logs then show that a request 
is sent to the jwks_uri that is indicated in the openid-configuration document. 
The response to that is logged, and it appears to be appropriate.

We then get some logging starting with "Certificate is", so it looks like we're 
getting as far as WebTokenEngine::validate_signature. So, several things appear 
to have happened successfully – we've loading the OIDC provider that 
corresponds to the iss, and we've found a client ID that corresponds to what I 
registered when I configured things. (This is why I say we appear to be a fair 
ways down the road – a lot of this is working).

It looks as though what's happening in the code now is that it's iterating 
through the certificates given in the jwks_uri content. There are 6 
certificates listed, but the code only gets as far as the first one. Looking at 
the code, what appears to be happening is that, among the various certificates 
in the jwks_uri, it's finding the first one which matches a thumbprint 
registered with Ceph (that is, which I registered with Ceph). This must be 
succeeding (for the first certificate), because the "Signature validation 
failed" logging comes later. So, the code does verify that the thumbprint of 
the first certificate matches one of the thumbprints I registered with Ceph for 
this OIDC provider.

We then get to a part of the code where it tries to verify the JWT using the 
certificate, with jwt::verify. Given what gets logged ("Signature validateion 
failed: ", this must be throwing an exception.

The thing I find surprising about this is that there really isn't any reason to 
think that the first certificate listed in the jwks_uri content is going to be 
the certificate used to sign the JWT. If I understand JWT correctly, it's 
appropriate to sign the JWT with any of the certificates listed in the jwks_uri 
content. Furthermore, the JWT header includes a reference to the kid, so it's 
possible for Ceph to know exactly which certificate the JWT purports to be 
signed by. And, Ceph knows that there might be multiple thumbprints, because we 
can register 5. So, the logic of trying the first valid certificate in x5c and 
then stopping if it fails seems broken, actually.

I suppose what I could do as a workaround is try to figure out whether Azure AD 
is consistently using the same kid to sign the JWTs for me, and then only 
register that thumbprint with Ceph. Then, Ceph would actually choose the 
correct certificate (as the others wouldn't match a thumbprint I registered). I 
may try this – in part, just to verify what I think is happening. But it would 
be awfully fragile – I don't believe there is any requirement in JWT to just 
use one of the certificates listed in x5c.

An alternative would be to try rewriting the code to apply a different kind of 
logic. The way it ought to work (it seems to me) is something like this:


  *
Get the openid_configuration, and get the jwks stuff from the jwks_uri (which 
Ceph does already).
  *
Look at the header of the JWT to see which kid it purports to be signed by.
  *
Find the certificate that corresponds to that kid (from the jwks_uri content)
  *
Validate the JWT with that certificate.

That ought to work, at least given what I'm seeing. (But, I'm not a JWT expert, 
so I don't know whether there is something unusual in how Azure AD generates 
JWT's and handles the jwks_uri content).

Anyway, I'm curious whether anyone else 

[ceph-users] Re: Fixing BlueFS spillover (pacific 16.2.14)

2024-07-08 Thread Frédéric Nass
Hello,

I just wanted to share that the following command also helped us move slow used 
bytes back to the fast device (without using bluefs-bdev-expand), when several 
compactions couldn't:

$ cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool 
bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source 
/var/lib/ceph/osd/ceph-${osd}/block --dev-target 
/var/lib/ceph/osd/ceph-${osd}/block.db

slow_used_bytes is now back to 0 on perf dump and BLUEFS_SPILLOVER alert got 
cleared but 'bluefs stats' is not on par:

$ ceph tell osd.451 bluefs stats
1 : device size 0x1effbfe000 : using 0x30960(12 GiB)
2 : device size 0x746dfc0 : using 0x3abd77d2000(3.7 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV WAL DB  SLOW*   *   REAL
FILES   
LOG 0 B 22 MiB  0 B 0 B 0 B 3.9 MiB 
1   
WAL 0 B 33 MiB  0 B 0 B 0 B 32 MiB  
2   
DB  0 B 12 GiB  0 B 0 B 0 B 12 GiB  
196 
SLOW0 B 4 MiB   0 B 0 B 0 B 3.8 MiB 
1   
TOTAL   0 B 12 GiB  0 B 0 B 0 B 0 B 
200 
MAXIMUMS:
LOG 0 B 22 MiB  0 B 0 B 0 B 17 MiB  

WAL 0 B 33 MiB  0 B 0 B 0 B 32 MiB  

DB  0 B 24 GiB  0 B 0 B 0 B 24 GiB  

SLOW0 B 4 MiB   0 B 0 B 0 B 3.8 MiB 

TOTAL   0 B 24 GiB  0 B 0 B 0 B 0 B 

>> SIZE <<  0 B 118 GiB 6.9 TiB

Any idea? Is this something to worry about?

Regards,
Frédéric.

- Le 16 Oct 23, à 14:46, Igor Fedotov igor.fedo...@croit.io a écrit :

> Hi Chris,
> 
> for the first question (osd.76) you might want to try ceph-volume's "lvm
> migrate --from data --target " command. Looks like some
> persistent DB remnants are still kept at main device causing the alert.
> 
> W.r.t osd.86's question - the line "SLOW    0 B 3.0 GiB
> 59 GiB" means that RocksDB higher levels  data (usually L3+) are spread
> over DB and main (aka slow) devices as 3 GB and 59 GB respectively.
> 
> In other words SLOW row refers to DB data which is originally supposed
> to be at SLOW device (due to RocksDB data mapping mechanics). But
> improved bluefs logic (introduced by
> https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage
> for a part of this data.
> 
> Resizing DB volume and following DB compaction should do the trick and
> move all the data to DB device. Alternatively ceph-volume's lvm migrate
> command should do the same but the result will be rather temporary
> without DB volume resizing.
> 
> Hope this helps.
> 
> 
> Thanks,
> 
> Igor
> 
> On 06/10/2023 06:55, Chris Dunlop wrote:
>> Hi,
>>
>> tl;dr why are my osds still spilling?
>>
>> I've recently upgraded to 16.2.14 from 16.2.9 and started receiving
>> bluefs spillover warnings (due to the "fix spillover alert" per the
>> 16.2.14 release notes). E.g. from 'ceph health detail', the warning on
>> one of these (there are a few):
>>
>> osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of
>> 60 GiB) to slow device
>>
>> This is a 15T HDD with only a 60G SSD for the db so it's not
>> surprising it spilled as it's way below the recommendation for rbd
>> usage at db size 1-2% of the storage size.
>>
>> There was some spare space on the db ssd so I increased the size of
>> the db LV up over 400G and did an bluefs-bdev-expand.
>>
>> However, days later, I'm still getting the spillover warning for that
>> osd, including after running a manual compact:
>>
>> # ceph tell osd.76 compact
>>
>> See attached perf-dump-76 for the perf dump output:
>>
>> # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq
>> -r '.bluefs'
>>
>> In particular, if my understanding is correct, that's telling me the
>> db available size is 487G (i.e. the LV expand worked), of which it's
>> using 59G, and there's 128K spilled to the slow device:
>>
>> "db_total_bytes": 512309059584,  # 487G
>> "db_used_bytes": 63470305280,    # 59G
>> "slow_used_bytes": 131072,   # 128K
>>
>> A "bluefs stats" also says the db is using 128K of slow storage
>> (although perhaps it's getting the info from the same place as the
>> perf dump?):
>>
>> # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using
>> 0xea620(59 GiB)
>> 2 : device size 0xe8d7fc0 : using 0x6554d689000(6.3 TiB)
>> RocksDBBlueFSVolumeSelector Usage Matrix:
>> DEV/LEV WAL DB  SLOW    * *
>> REAL    FILES   LOG 0 B 10 MiB  0
>> B 0 B 0 B 8.8 MiB 1   WAL 0
>> B 2.5 GiB 0 B 0 B 0 B   

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Ivan Clayson

Hi Dhairya,

Thank you ever so much for having another look at this so quickly. I 
don't think I have any logs similar to the ones you referenced this time 
as my MDSs don't seem to enter the replay stage when they crash (or at 
least don't now after I've thrown the logs away) but those errors do 
crop up in the prior logs I shared when the system first crashed.


Kindest regards,

Ivan

On 08/07/2024 14:08, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Ugh, something went horribly wrong. I've downloaded the MDS logs that 
contain assertion failure and it looks relevant to this [0]. Do you 
have client logs for this?


The other log that you shared is being downloaded right now, once 
that's done and I'm done going through it, I'll update you.


[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson  
wrote:


Hi Dhairya,

Sorry to resurrect this thread again, but we still unfortunately
have an issue with our filesystem after we attempted to write new
backups to it.

We finished the scrub of the filesystem on Friday and ran a repair
scrub on the 1 directory which had metadata damage. After doing so
and rebooting, the cluster reported no issues and data was
accessible again.

We re-started the backups to run over the weekend and
unfortunately the filesystem crashed again where the log of the
failure is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
We ran the backups on kernel mounts of the filesystem without the
nowsync option this time to avoid the out-of-sync write problems..

I've tried resetting the journal again after recovering the
dentries but unfortunately the filesystem is still in a failed
state despite setting joinable to true. The log of this crash is
here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

I'm not sure how to proceed as I can't seem to get any MDS to take
over the first rank. I would like to do a scrub of the filesystem
and preferably overwrite the troublesome files with the originals
on the live filesystem. Do you have any advice on how to make the
filesystem leave its failed state? I have a backup of the journal
before I reset it so I can roll back if necessary.

Here are some details about the filesystem at present:

root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
flag(s) set
    1750 pgs not deep-scrubbed in time
    1612 pgs not scrubbed in time

  services:
    mon: 4 daemons, quorum
pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
    mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
pebbles-s3, pebbles-s4
    mds: 1/2 daemons up, 3 standby
    osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d);
10 remapped pgs
 flags
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

  data:
    volumes: 1/2 healthy, 1 recovering; 1 damaged
    pools:   7 pools, 2177 pgs
    objects: 3.24G objects, 6.7 PiB
    usage:   8.6 PiB used, 14 PiB / 23 PiB avail
    pgs: 11785954/27384310061 objects misplaced (0.043%)
 2167 active+clean
 6    active+remapped+backfilling
 4    active+remapped+backfill_wait

ceph_backup - 0 clients
===
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
    POOL    TYPE USED  AVAIL
   mds_backup_fs  metadata  1174G  3071G
ec82_primary_fs_data    data   0   3071G
  ec82pool  data    8085T  4738T
ceph_archive - 2 clients

RANK  STATE  MDS ACTIVITY DNS    INOS DIRS   CAPS
 0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105 118  2
    POOL    TYPE USED  AVAIL
   mds_archive_fs metadata  5184M  3071G
ec83_primary_fs_data    data   0   3071G
  ec83pool  data 138T  4307T
STANDBY MDS
 pebbles-s2
 pebbles-s3
 pebbles-s1
MDS version: ceph version 17.2.7

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Dhairya Parmar
Ugh, something went horribly wrong. I've downloaded the MDS logs that
contain assertion failure and it looks relevant to this [0]. Do you have
client logs for this?

The other log that you shared is being downloaded right now, once that's
done and I'm done going through it, I'll update you.

[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> Sorry to resurrect this thread again, but we still unfortunately have an
> issue with our filesystem after we attempted to write new backups to it.
>
> We finished the scrub of the filesystem on Friday and ran a repair scrub
> on the 1 directory which had metadata damage. After doing so and rebooting,
> the cluster reported no issues and data was accessible again.
>
> We re-started the backups to run over the weekend and unfortunately the
> filesystem crashed again where the log of the failure is here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
> We ran the backups on kernel mounts of the filesystem without the nowsync
> option this time to avoid the out-of-sync write problems..
>
> I've tried resetting the journal again after recovering the dentries but
> unfortunately the filesystem is still in a failed state despite setting
> joinable to true. The log of this crash is here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
> .
>
> I'm not sure how to proceed as I can't seem to get any MDS to take over
> the first rank. I would like to do a scrub of the filesystem and preferably
> overwrite the troublesome files with the originals on the live filesystem.
> Do you have any advice on how to make the filesystem leave its failed
> state? I have a backup of the journal before I reset it so I can roll back
> if necessary.
>
> Here are some details about the filesystem at present:
>
> root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
>   cluster:
> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 large omap objects
> 1 filesystem is offline
> 1 mds daemon damaged
>
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set
> 1750 pgs not deep-scrubbed in time
> 1612 pgs not scrubbed in time
>
>   services:
> mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
> (age 50m)
> mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3,
> pebbles-s4
> mds: 1/2 daemons up, 3 standby
> osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped
> pgs
>  flags
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
>
>   data:
> volumes: 1/2 healthy, 1 recovering; 1 damaged
> pools:   7 pools, 2177 pgs
> objects: 3.24G objects, 6.7 PiB
> usage:   8.6 PiB used, 14 PiB / 23 PiB avail
> pgs: 11785954/27384310061 objects misplaced (0.043%)
>  2167 active+clean
>  6active+remapped+backfilling
>  4active+remapped+backfill_wait
>
> ceph_backup - 0 clients
> ===
> RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
>  0failed
> POOLTYPE USED  AVAIL
>mds_backup_fs  metadata  1174G  3071G
> ec82_primary_fs_datadata   0   3071G
>   ec82pool  data8085T  4738T
> ceph_archive - 2 clients
> 
> RANK  STATE  MDS ACTIVITY DNSINOS   DIRS   CAPS
>  0active  pebbles-s4  Reqs:0 /s  13.4k  7105118  2
> POOLTYPE USED  AVAIL
>mds_archive_fs metadata  5184M  3071G
> ec83_primary_fs_datadata   0   3071G
>   ec83pool  data 138T  4307T
> STANDBY MDS
>  pebbles-s2
>  pebbles-s3
>  pebbles-s1
> MDS version: ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> root@pebbles-s2 11:55 [~]: ceph fs dump
> e2643889
> enable_multiple, ever_enabled_multiple: 1,1
> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: 1
>
> Filesystem 'ceph_backup' (1)
> fs_nameceph_backup
> epoch2643888
> flags12 joinable allow_snaps allow_multimds_snaps
> created2023-05-19T12:52:36.302135+0100
> modified2024-07-08T11:17:55.437861+0100
> tableserver0
> root0
> session_timeout60
> session_autoclose300
> max_file_size109

[ceph-users] Slow osd ops on large arm cluster

2024-07-08 Thread Adam Prycki

Hello,

we are having issues with slow ops on our large ARM hpc ceph cluster.

Cluster runs on 18.2.0 and ubutnu 20.04
MONs, MGRs and MDSs had to be moved to intel servers because of poor 
single core performance on our arm servers.
Our main cephfs data pool is on 54 serwers in 9 racks with 1458 HDDs in 
total. (OSDs without block.db on ssd)
Cephfs data pool is configured as erasure coded pool with k=6 m=2 and 
rack level replication. Pool has about 16k PGs with average pg per osd 
at ~90.


We have had good experience with EC cephfs on 3,5 times smaller intel 
ceph cluster. But this arm deployment is becoming problematic. We 
started experiencing issues since one of the users started to generate 
sequential RW traffic at at about 5GiB/s. Single OSD with slow ops was 
enough to create laggy PG and crash application generating this traffic.
We've even had issue where osd with slow ops was lagged for 6 hours and 
required manual restart.


Now we are experiencing slow ops even at much lower read only traffic 
~400MiB/s


Here is an example of slow ops on OSD:
{
"ops": [
{
"description": "osd_op(client.255949991.0:92728602 4.d22s0 
4:44b3390a:::1000b640ddc.039b:head [read 3633152~8192] snapc 0=[] 
ondisk+read+known_if_redirected e1117246)",

"initiated_at": "2024-07-08T10:19:58.469537+",
"age": 507.242936848,
"duration": 507.2429885483,
"type_data": {
"flag_point": "started",
"client_info": {
"client": "client.255949991",
"client_addr": "x.x.x.x:0/887459214",
"tid": 92728602
},
"events": [
{
"event": "initiated",
"time": "2024-07-08T10:19:58.469537+",
"duration": 0
},
{
"event": "throttled",
"time": "2024-07-08T10:19:58.469537+",
"duration": 0
},
{
"event": "header_read",
"time": "2024-07-08T10:19:58.469535+",
"duration": 4294967295.981
},
{
"event": "all_read",
"time": "2024-07-08T10:19:58.469571+",
"duration": 3.5859e-05
},
{
"event": "dispatched",
"time": "2024-07-08T10:19:58.469573+",
"duration": 2.08e-06
},
{
"event": "queued_for_pg",
"time": "2024-07-08T10:19:58.469586+",
"duration": 1.27210001e-05
},
{
"event": "reached_pg",
"time": "2024-07-08T10:19:58.485132+",
"duration": 0.0155460489
},
{
"event": "started",
"time": "2024-07-08T10:19:58.485147+",
"duration": 1.5161e-05
}
]
}
},
HDD with this OSD is not busy. Arm cores on these servers are slow but 
no process reaches full 100% core usage.


I think we may have the same issue as one described here: 
https://www.mail-archive.com/ceph-users@ceph.io/msg13273.html


I've tried to reduce osd_pool_default_read_lease_ratio form 0.8 to 0.2
I've tried to reduce osd_heartbeat_grace from 20 to 10.
It should lower read_lease_interval from 16 to 2 but it didn't help. 
Still see a lot of slow ops.


Could you give me tips what I could tune to fix this issue?

Could this be an issue with large number of EC PGs on large cluster with 
weak CPUs?


Best regards
Adam Prycki
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg's stuck activating on osd create

2024-07-08 Thread Eugen Block

Hi,

it depends a bit on the actual OSD layout on the node and your  
procedure, but there's a chance you might have hit the overdose. But I  
would expect it to be logged in the OSD logs, two years ago in a  
Nautilus cluster the message looked like this:



maybe_wait_for_max_pg withhold creation of pg ...


According to github in 16.2.15 it could look like this:


maybe_wait_for_max_pg hit max pg, dropping ...


But I'm not sure, I haven't seen that in newer clusters (yet).

Regards,
Eugen

Zitat von Richard Bade :


Hi Everyone,
I had an issue last night when I was bringing online some osds that I
was rebuilding. When the osds created and came online 15pgs got stuck
in activating. The first osd (osd.112) seemed to come online ok, but
the second one (osd.113) triggered the issue. All the pgs in
activating included osd.112 in the pg map and I resolved it by doing
pg-upmap-items to map the pg back from osd.112 to where it currently
was but it was painful having 10min of stuck i/o os an rbd pool with
vm's running.

Some details about the cluster:
Pacific 16.2.15, upgraded from Nautilus fairly recently and Luminos
back in the past. All osds were rebuilt on bluestore in Nautilus, as
were the mons.
The disks in question are Intel DC P4510 8TB nvme. I'm rebuilding them
as I had previously had 4x2TB osd's per disk and now wanted to
consolidate down to one osd per disk.
There's around 300 osd's in the pool with 16384 pgs which means that
the 2TB osds had 157pgs on them. However this means that the 8TB osds
have 615pgs on them and I'm wondering if this is maybe the cause of
the problem.

There are no warnings about too many pgs per osd in the logs or ceph status.
I have the default value of 250 for mon_max_pg_per_osd and default
value of 3.0 for osd_max_pg_per_osd_hard_ratio.

My plan is to reduce the number of pgs in the pool but I want to
understand and prove what happened here.
Is it likely I've hit pg overdose protection? If I have, how would I
tell as I can't see anything in the cluster logs.

Thanks,
Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD Mirror - Failed to unlink peer

2024-07-08 Thread Eugen Block

Hi,

sorry for the delayed response, I was on vacation.

I would set the "debug_rbd_mirror" config to 15 (or higher) and then  
watch the logs:


# ceph config set client.rbd-mirror. debug_rbd_mirror 15

Maybe that reveals anything.

Regards,
Eugen

Zitat von scott.cai...@tecnica-ltd.co.uk:

Thanks - hopefully I'll hear back from devs then as I can't seem to  
find anything online about others encountering the same warning, but  
I surely can't be the only one!


Would it be the rbd subsystem I'm looking to increase to debug level  
15 or is there another subsystem for rbd mirroring?
What would be the best way to enable it (ceph config set client  
debug_rbd 20 then change back to 0/5 once done)?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Ivan Clayson

Hi Dhairya,

Sorry to resurrect this thread again, but we still unfortunately have an 
issue with our filesystem after we attempted to write new backups to it.


We finished the scrub of the filesystem on Friday and ran a repair scrub 
on the 1 directory which had metadata damage. After doing so and 
rebooting, the cluster reported no issues and data was accessible again.


We re-started the backups to run over the weekend and unfortunately the 
filesystem crashed again where the log of the failure is here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. 
We ran the backups on kernel mounts of the filesystem without the 
nowsync option this time to avoid the out-of-sync write problems..


I've tried resetting the journal again after recovering the dentries but 
unfortunately the filesystem is still in a failed state despite setting 
joinable to true. The log of this crash is here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.


I'm not sure how to proceed as I can't seem to get any MDS to take over 
the first rank. I would like to do a scrub of the filesystem and 
preferably overwrite the troublesome files with the originals on the 
live filesystem. Do you have any advice on how to make the filesystem 
leave its failed state? I have a backup of the journal before I reset it 
so I can roll back if necessary.


Here are some details about the filesystem at present:

   root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
   nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
   flag(s) set
    1750 pgs not deep-scrubbed in time
    1612 pgs not scrubbed in time

  services:
    mon: 4 daemons, quorum
   pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
    mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
   pebbles-s3, pebbles-s4
    mds: 1/2 daemons up, 3 standby
    osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10
   remapped pgs
 flags
   nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

  data:
    volumes: 1/2 healthy, 1 recovering; 1 damaged
    pools:   7 pools, 2177 pgs
    objects: 3.24G objects, 6.7 PiB
    usage:   8.6 PiB used, 14 PiB / 23 PiB avail
    pgs: 11785954/27384310061 objects misplaced (0.043%)
 2167 active+clean
 6    active+remapped+backfilling
 4    active+remapped+backfill_wait

   ceph_backup - 0 clients
   ===
   RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
    POOL    TYPE USED  AVAIL
   mds_backup_fs  metadata  1174G  3071G
   ec82_primary_fs_data    data   0   3071G
  ec82pool  data    8085T  4738T
   ceph_archive - 2 clients
   
   RANK  STATE  MDS ACTIVITY DNS    INOS   DIRS CAPS
 0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105    118 2
    POOL    TYPE USED  AVAIL
   mds_archive_fs metadata  5184M  3071G
   ec83_primary_fs_data    data   0   3071G
  ec83pool  data 138T  4307T
   STANDBY MDS
 pebbles-s2
 pebbles-s3
 pebbles-s1
   MDS version: ceph version 17.2.7
   (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
   root@pebbles-s2 11:55 [~]: ceph fs dump
   e2643889
   enable_multiple, ever_enabled_multiple: 1,1
   default compat: compat={},rocompat={},incompat={1=base
   v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir
   inode in separate object,5=mds uses versioned encoding,6=dirfrag is
   stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
   legacy client fscid: 1

   Filesystem 'ceph_backup' (1)
   fs_name    ceph_backup
   epoch    2643888
   flags    12 joinable allow_snaps allow_multimds_snaps
   created    2023-05-19T12:52:36.302135+0100
   modified    2024-07-08T11:17:55.437861+0100
   tableserver    0
   root    0
   session_timeout    60
   session_autoclose    300
   max_file_size    10993418240
   required_client_features    {}
   last_failure    0
   last_failure_osd_epoch    494515
   compat    compat={},rocompat={},incompat={1=base v0.20,2=client
   writeable ranges,3=default file layouts on dirs,4=dir inode in
   separate object,5=mds uses versioned encoding,6=dirfrag is stored in
   omap,7=mds uses inline data,8=no anchor table,9=file layout
   v2,10=snaprealm v2}
   max_mds    1
   in    0
   up    {}
   failed
   damaged    0
   stopped
   data_pools    [6,3]
   metadata_pool    2
   inline_data    disabled
   balancer
   standby_count_wanted    1


Kindest regards,

Ivan

On 28/06/2024 15:17, Dhairya Parmar

[ceph-users] Re: Sanity check

2024-07-08 Thread Eugen Block

Hi,

your crush rule distributes each chunk on a different host, so your  
failure domain is host. The crush-failure-domain=osd from the EC  
profile most likely is from the initial creation, maybe it was  
supposed to be OSD during initial tests or whatever, but the crush  
rule is key here.


We thought we testing this by turning off 2 hosts, we have had one  
host offline recently and the cluster was still serving clients -  
did we get lucky?


No, you didn't get lucky. By default, an EC pool's min_size is k + 1,  
which is 7 in your case. You have 8 chunks in total distributed across  
different hosts, turning off one host results in 7 available chunks,  
so the pool is still serving clients. If you shut down one more host,  
the pool will become inactive.


Regards,
Eugen

Zitat von Adam Witwicki :


Hello,

Can someone please let me know what failure domain my erasure code  
pool is, osd or host?
We thought we testing this by turning off 2 hosts, we have had one  
host offline recently and the cluster was still serving clients -  
did we get lucky?


ceph osd pool get  crush_rule
crush_rule: ecpool

ceph osd pool get  erasure_code_profile
erasure_code_profile: 6-2

rule ecpool {
id 3
type erasure
min_size 3
max_size 10
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}


ceph osd erasure-code-profile get 6-2
crush-device-class=hdd
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8



octopus

Regards


Adam


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific 16.2.15 `osd noin`

2024-07-08 Thread Stefan Kooman

On 02-04-2024 15:09, Zakhar Kirpichenko wrote:

Hi,

I'm adding a few OSDs to an existing cluster, the cluster is running with
`osd noout,noin`:

   cluster:
 id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86
 health: HEALTH_WARN
 noout,noin flag(s) set

Specifically `noin` is documented as "prevents booting OSDs from being
marked in". But freshly added OSDs were immediately marked `up` and `in`:

   services:
 ...
 osd: 96 osds: 96 up (since 5m), 96 in (since 6m); 338 remapped pgs
  flags noout,noin

# ceph osd tree in | grep -E "osd.11|osd.12|osd.26"
  11hdd9.38680  osd.11   up   1.0  1.0
  12hdd9.38680  osd.12   up   1.0  1.0
  26hdd9.38680  osd.26   up   1.0  1.0

Is this expected behavior? Do I misunderstand the purpose of the `noin`
option?


We have "mon_osd_auto_mark_new_in = false" configured for this reason. 
With this configuration option setting OSDs "IN" becomes a manual operation.


If you don't want to have OSDs marked in automatically after they have 
been marked out you can use this option as well:


mon_osd_auto_mark_auto_out_in = false.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io