Hey,

I have questions related to sasl.oauthbearer.jwks.endpoint.url,
specifically regarding error handling when the specified URL cannot be
queried for some reason.

Imagine a case where a Kafka broker starts with multiple listeners: one
using OIDC and any number of connectors not using OIDC.

Let’s consider two cases.

Case 1:


   1.

   Kafka cluster is started and starts fine
   2.

   Later imagine the JWKS URL stops responding or responds with a malformed
   response. The broker fails to refresh the JWKS cache.
   3.

   Kafka remains operational, including the OIDC listener. If there is a
   new ‘kid’ in the JWKS URL response, Kafka will not become aware of that
   ‘kid’. Thus, connections using the new kid won’t work. Connections using
   kids Kafka is aware of will keep on working.


Case 2:


   1.

   Kafka cluster is started
   2.

   JWKS URL is not responding or responds with a malformed response
   3.

   Kafka broker exits because it fails to query the JWKS URL.


In both cases the situation is the same, JWKS URL is not responding. But
because of the timing of the failure (before or after broker startup), the
consequences are completely different. In the latter case the consequence
is fatal, the broker doesn’t start at all.

In both situations it can be argued that Kafka should run. Specifically, it
shouldn’t crash in case 2. The non-OIDC listeners would work just fine. In
the first case, the OIDC listener would work in practise only with the kids
the broker was aware of before the JWKS URL stopped working. In the case 2,
the OIDC listener wouldn’t work at all since the broker is not aware of any
kids. This seems logical.

This crashing behaviour originates from KIP-768
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575
that specifically states:

> If the the [sic] URL or file that is specified cannot be read, the broker
will fail to start up.

There is no justification _why_ this has been decided. The documentation of
“sasl.oauthbearer.jwks.endpoint.url” doesn’t mention this behaviour. It
came as a surprise for us when we were using the feature.

We’re investigating whether the broker behaviour in the case 2 could be
changed to be like this:



   1.

   Kafka cluster is started
   2.

   JWKS URL is not responding or responds with a malformed response
   3.

   Kafka broker starts. OIDC listener doesn’t work, other listeners do work


We’ve implemented a simple patch for this for kafka versions 3.8 - 4.1. In
practise, instead of crashing the broker just leaves the JWKS cache empty.
If fetching from the JWS URL later on succeeds, the cache is populated and
the OIDC listener starts working again.

We’re planning to file a KIP to propose changing this behaviour as
explained above.

Any thoughts?

Regards,
Juha Mynttinen

Reply via email to