andrewhharmon opened a new issue, #61737:
URL: https://github.com/apache/airflow/issues/61737

   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==10.12.3 (regression introduced in 
10.12.0)
   Working in: apache-airflow-providers-cncf-kubernetes==10.11.0
   
   ### Apache Airflow version
   
   3.0.0 (also affects 2.x with the affected provider version)
   
   ### Operating System
   
    Debian/Ubuntu-based containers (Astronomer Runtime)
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   Triggerer runs on a separate host from the worker. EKS cluster 
authentication uses exec-based kubeconfig (`aws eks get-token`), where the exec 
command must be re-invoked periodically to obtain fresh short-lived tokens.
   
   ### What happened
   
   `KubernetesPodTrigger` fails with 401 Unauthorized after ~15 minutes when 
using exec-based kubeconfig authentication (e.g., EKS clusters with `aws eks 
get-token`).
   
   **Root cause:** In version 10.12.0, a `_config_loaded` caching guard was 
added to `AsyncKubernetesHook._load_config()`:
   
   ```python
   async def _load_config(self):
       """Load Kubernetes configuration once per hook instance."""
       if self._config_loaded:    # <-- new in 10.12.x
           return
       # ... load config, execute exec plugin, get token ...
       self._config_loaded = True
   ```
   
   In previous versions (10.11.x and earlier), `_load_config()` ran on every 
`get_conn()` call. This meant the exec plugin (e.g., `aws eks get-token`) was 
re-invoked on each poll, always producing a fresh token.
   
   With the `_config_loaded` guard, the exec plugin runs **once** for the 
lifetime of the hook instance. Since `KubernetesPodTrigger.hook` is a 
`@cached_property`, the hook (and therefore the stale token) persists for the 
entire duration of the trigger. EKS STS tokens expire after ~15 minutes, so any 
pod monitored longer than that gets 401 Unauthorized.
   
   **Error output:**
   ```
   kubernetes_asyncio.client.exceptions.ApiException: (401)
   Reason: Unauthorized
   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},
   
"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
   ```
   
   **Stack trace (from triggerer):**
   ```
   File "airflow/providers/cncf/kubernetes/triggers/pod.py", line 318, in 
_get_pod
       pod = await self.hook.get_pod(name=self.pod_name, 
namespace=self.pod_namespace)
   File "airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 948, in 
get_pod
       pod: V1Pod = await v1_api.read_namespaced_pod(
   ```
   
   The `@tenacity.retry` on `_get_pod()` (3 attempts) and `@generic_api_retry` 
on `get_pod()` do not help because every retry reuses the same cached hook with 
the same expired token.
   
   ### What you think should happen instead
   
   `_load_config()` should support exec-based auth that requires periodic token 
refresh. The `_config_loaded` optimization is valid for static credentials 
(bearer tokens, certificates, in-cluster service accounts) but breaks 
exec-based credential plugins that produce short-lived tokens.
   
   Possible approaches:
   
   1. **Track the exec token's expiration and reload when needed.** When 
`load_kube_config_from_dict()` processes an exec plugin, the response includes 
an `expirationTimestamp`. The hook could store this and reset `_config_loaded` 
when approaching expiry.
   
   2. **Reset `_config_loaded` periodically.** A simpler approach — reset the 
flag on a configurable interval (e.g., 10 minutes) so that exec plugins are 
re-invoked before typical token lifetimes expire.
   
   3. **Don't cache when config uses exec-based auth.** After loading the 
config, check if the user auth uses an exec plugin. If so, skip setting 
`_config_loaded = True` so it reloads on each `get_conn()` call (restoring the 
pre-10.12.0 behavior for exec-based configs).
   
   ### How to reproduce
   
   1. Configure a `KubernetesPodOperator` (or `EksPodOperator`) with 
`deferrable=True` connecting to a cluster that uses exec-based kubeconfig auth 
(e.g., EKS with `aws eks get-token`)
   2. Use `apache-airflow-providers-cncf-kubernetes>=10.12.0`
   3. Run a pod that takes longer than the exec token's lifetime (~15 minutes 
for EKS)
   4. Observe 401 Unauthorized after the token expires
   
   To verify the regression, downgrade to 
`apache-airflow-providers-cncf-kubernetes==10.11.0` — the same DAG will succeed.
   
   ### Anything else
   
   **Affected authentication methods:** Any exec-based credential plugin that 
produces short-lived tokens. This includes:
   - AWS EKS (`aws eks get-token`) — tokens expire in ~15 minutes
   - GKE with `gke-gcloud-auth-plugin` — tokens expire in ~60 minutes
   - Any custom exec plugin with token expiration
   
   **Not affected:** Static bearer tokens, client certificates, in-cluster 
service account tokens (which are auto-rotated by the kubelet).
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to