andrewhharmon opened a new issue, #61737:
URL: https://github.com/apache/airflow/issues/61737
### Apache Airflow Provider(s)
cncf-kubernetes
### Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==10.12.3 (regression introduced in
10.12.0)
Working in: apache-airflow-providers-cncf-kubernetes==10.11.0
### Apache Airflow version
3.0.0 (also affects 2.x with the affected provider version)
### Operating System
Debian/Ubuntu-based containers (Astronomer Runtime)
### Deployment
Astronomer
### Deployment details
Triggerer runs on a separate host from the worker. EKS cluster
authentication uses exec-based kubeconfig (`aws eks get-token`), where the exec
command must be re-invoked periodically to obtain fresh short-lived tokens.
### What happened
`KubernetesPodTrigger` fails with 401 Unauthorized after ~15 minutes when
using exec-based kubeconfig authentication (e.g., EKS clusters with `aws eks
get-token`).
**Root cause:** In version 10.12.0, a `_config_loaded` caching guard was
added to `AsyncKubernetesHook._load_config()`:
```python
async def _load_config(self):
"""Load Kubernetes configuration once per hook instance."""
if self._config_loaded:# <-- new in 10.12.x
return
# ... load config, execute exec plugin, get token ...
self._config_loaded = True
```
In previous versions (10.11.x and earlier), `_load_config()` ran on every
`get_conn()` call. This meant the exec plugin (e.g., `aws eks get-token`) was
re-invoked on each poll, always producing a fresh token.
With the `_config_loaded` guard, the exec plugin runs **once** for the
lifetime of the hook instance. Since `KubernetesPodTrigger.hook` is a
`@cached_property`, the hook (and therefore the stale token) persists for the
entire duration of the trigger. EKS STS tokens expire after ~15 minutes, so any
pod monitored longer than that gets 401 Unauthorized.
**Error output:**
```
kubernetes_asyncio.client.exceptions.ApiException: (401)
Reason: Unauthorized
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},
"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
```
**Stack trace (from triggerer):**
```
File "airflow/providers/cncf/kubernetes/triggers/pod.py", line 318, in
_get_pod
pod = await self.hook.get_pod(name=self.pod_name,
namespace=self.pod_namespace)
File "airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 948, in
get_pod
pod: V1Pod = await v1_api.read_namespaced_pod(
```
The `@tenacity.retry` on `_get_pod()` (3 attempts) and `@generic_api_retry`
on `get_pod()` do not help because every retry reuses the same cached hook with
the same expired token.
### What you think should happen instead
`_load_config()` should support exec-based auth that requires periodic token
refresh. The `_config_loaded` optimization is valid for static credentials
(bearer tokens, certificates, in-cluster service accounts) but breaks
exec-based credential plugins that produce short-lived tokens.
Possible approaches:
1. **Track the exec token's expiration and reload when needed.** When
`load_kube_config_from_dict()` processes an exec plugin, the response includes
an `expirationTimestamp`. The hook could store this and reset `_config_loaded`
when approaching expiry.
2. **Reset `_config_loaded` periodically.** A simpler approach — reset the
flag on a configurable interval (e.g., 10 minutes) so that exec plugins are
re-invoked before typical token lifetimes expire.
3. **Don't cache when config uses exec-based auth.** After loading the
config, check if the user auth uses an exec plugin. If so, skip setting
`_config_loaded = True` so it reloads on each `get_conn()` call (restoring the
pre-10.12.0 behavior for exec-based configs).
### How to reproduce
1. Configure a `KubernetesPodOperator` (or `EksPodOperator`) with
`deferrable=True` connecting to a cluster that uses exec-based kubeconfig auth
(e.g., EKS with `aws eks get-token`)
2. Use `apache-airflow-providers-cncf-kubernetes>=10.12.0`
3. Run a pod that takes longer than the exec token's lifetime (~15 minutes
for EKS)
4. Observe 401 Unauthorized after the token expires
To verify the regression, downgrade to
`apache-airflow-providers-cncf-kubernetes==10.11.0` — the same DAG will succeed.
### Anything else
**Affected authentication methods:** Any exec-based credential plugin that
produces short-lived tokens. This includes:
- AWS EKS (`aws eks get-token`) — tokens expire in ~15 minutes
- GKE with `gke-gcloud-auth-plugin` — tokens expire in ~60 minutes
- Any custom exec plugin with token expiration
**Not affected:** Static bearer tokens, client certificates, in-cluster
service account tokens