Hey, I have questions related to sasl.oauthbearer.jwks.endpoint.url, specifically regarding error handling when the specified URL cannot be queried for some reason.
Imagine a case where a Kafka broker starts with multiple listeners: one using OIDC and any number of connectors not using OIDC. Let’s consider two cases. Case 1: 1. Kafka cluster is started and starts fine 2. Later imagine the JWKS URL stops responding or responds with a malformed response. The broker fails to refresh the JWKS cache. 3. Kafka remains operational, including the OIDC listener. If there is a new ‘kid’ in the JWKS URL response, Kafka will not become aware of that ‘kid’. Thus, connections using the new kid won’t work. Connections using kids Kafka is aware of will keep on working. Case 2: 1. Kafka cluster is started 2. JWKS URL is not responding or responds with a malformed response 3. Kafka broker exits because it fails to query the JWKS URL. In both cases the situation is the same, JWKS URL is not responding. But because of the timing of the failure (before or after broker startup), the consequences are completely different. In the latter case the consequence is fatal, the broker doesn’t start at all. In both situations it can be argued that Kafka should run. Specifically, it shouldn’t crash in case 2. The non-OIDC listeners would work just fine. In the first case, the OIDC listener would work in practise only with the kids the broker was aware of before the JWKS URL stopped working. In the case 2, the OIDC listener wouldn’t work at all since the broker is not aware of any kids. This seems logical. This crashing behaviour originates from KIP-768 https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575 that specifically states: > If the the [sic] URL or file that is specified cannot be read, the broker will fail to start up. There is no justification _why_ this has been decided. The documentation of “sasl.oauthbearer.jwks.endpoint.url” doesn’t mention this behaviour. It came as a surprise for us when we were using the feature. We’re investigating whether the broker behaviour in the case 2 could be changed to be like this: 1. Kafka cluster is started 2. JWKS URL is not responding or responds with a malformed response 3. Kafka broker starts. OIDC listener doesn’t work, other listeners do work We’ve implemented a simple patch for this for kafka versions 3.8 - 4.1. In practise, instead of crashing the broker just leaves the JWKS cache empty. If fetching from the JWS URL later on succeeds, the cache is populated and the OIDC listener starts working again. We’re planning to file a KIP to propose changing this behaviour as explained above. Any thoughts? Regards, Juha Mynttinen
