[ceph-users] AssumeRoleWithWebIdentity in RGW with Azure AD

Ryan Rempel Mon, 08 Jul 2024 12:46:56 -0700

I'm trying to setup the OIDC provider for RGW so that I can have roles that can 
be assumed by people logging into their regular Azure AD identities. The client 
I'm planning to use is Cyberduck – it seems like one of the few GUI S3 clients 
that manages the OIDC login process in a way that could work for relatively 
naive users.

I've gotten a fair ways down the road. I've been able to configure Cyberduck so
that it performs the login with Azure AD, gets an identity token, and then
sends it to Ceph to engage with the AssumeRoleWithWebIdentity process. However,
I then get an error, which shows up in the Ceph rgw logs like this:

2024-07-08T17:18:09.749+0000 7fb2d7845700 0 req 15967124976712370684
1.284013867s sts:assume_role_web_identity Signature validation failed: evp
verify final failed: 0 error:0407008A:rsa
routines:RSA_padding_check_PKCS1_type_1:invalid padding

I turned the logging for rgw up to 20 to see if I could follow along to see how
much of the process succeeds and learn more about what fails. I can then see
logging messages from this file in the source code:

https://github.com/ceph/ceph/blob/08d7ff952d78d1bbda04d5ff7e3db1e733301072/src/rgw/rgw_rest_sts.cc

We get to WebTokenEngine::get_from_jwt, and it logs the JWT payload in a way
that seems to be as expected. The logs then indicate that a request is sent to
the /.well-known/openid-configuration endpoint that appears to be appropriate
for the issuer of the JWT. The logs eventually indicate what looks like a
successful and appropriate response to that. The logs then show that a request
is sent to the jwks_uri that is indicated in the openid-configuration document.
The response to that is logged, and it appears to be appropriate.

We then get some logging starting with "Certificate is", so it looks like we're
getting as far as WebTokenEngine::validate_signature. So, several things appear
to have happened successfully – we've loading the OIDC provider that
corresponds to the iss, and we've found a client ID that corresponds to what I
registered when I configured things. (This is why I say we appear to be a fair
ways down the road – a lot of this is working).

It looks as though what's happening in the code now is that it's iterating
through the certificates given in the jwks_uri content. There are 6
certificates listed, but the code only gets as far as the first one. Looking at
the code, what appears to be happening is that, among the various certificates
in the jwks_uri, it's finding the first one which matches a thumbprint
registered with Ceph (that is, which I registered with Ceph). This must be
succeeding (for the first certificate), because the "Signature validation
failed" logging comes later. So, the code does verify that the thumbprint of
the first certificate matches one of the thumbprints I registered with Ceph for
this OIDC provider.

We then get to a part of the code where it tries to verify the JWT using the
certificate, with jwt::verify. Given what gets logged ("Signature validateion
failed: ", this must be throwing an exception.

The thing I find surprising about this is that there really isn't any reason to
think that the first certificate listed in the jwks_uri content is going to be
the certificate used to sign the JWT. If I understand JWT correctly, it's
appropriate to sign the JWT with any of the certificates listed in the jwks_uri
content. Furthermore, the JWT header includes a reference to the kid, so it's
possible for Ceph to know exactly which certificate the JWT purports to be
signed by. And, Ceph knows that there might be multiple thumbprints, because we
can register 5. So, the logic of trying the first valid certificate in x5c and
then stopping if it fails seems broken, actually.

I suppose what I could do as a workaround is try to figure out whether Azure AD
is consistently using the same kid to sign the JWTs for me, and then only
register that thumbprint with Ceph. Then, Ceph would actually choose the
correct certificate (as the others wouldn't match a thumbprint I registered). I
may try this – in part, just to verify what I think is happening. But it would
be awfully fragile – I don't believe there is any requirement in JWT to just
use one of the certificates listed in x5c.

An alternative would be to try rewriting the code to apply a different kind of
logic. The way it ought to work (it seems to me) is something like this:

*
Get the openid_configuration, and get the jwks stuff from the jwks_uri (which
Ceph does already).
*
Look at the header of the JWT to see which kid it purports to be signed by.
*
Find the certificate that corresponds to that kid (from the jwks_uri content)
*
Validate the JWT with that certificate.

That ought to work, at least given what I'm seeing. (But, I'm not a JWT expert,
so I don't know whether there is something unusual in how Azure AD generates
JWT's and handles the jwks_uri content).

Anyway, I'm curious whether anyone else has been trying to get this to work
with Azure AD, and whether they have run into similar problems. And, of course,
whether I appear to be misunderstanding anything about how this is supposed to
work.

Ryan Rempel

Director of Information Technology

Canadian Mennonite University
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] AssumeRoleWithWebIdentity in RGW with Azure AD

Reply via email to