Vamsi-klu opened a new pull request, #61936:
URL: https://github.com/apache/airflow/pull/61936

   ## Why this change
   
   Issue #60943 reports intermittent KubernetesPodOperator task failures on 
Celery workers when multiple tasks start together and kubeconfig uses `aws eks 
get-token` exec auth.
   
   The failure mode is subtle:
   - the auth subprocess (`aws eks get-token`) can fail due to older botocore 
race behavior around `~/.aws/cli/cache`
   - Kubernetes client then proceeds with invalid/empty auth and surfaces a 
generic `403 Forbidden`
   - this looks identical to real RBAC failures, so operators often lose time 
debugging the wrong problem
   
   This PR adds **explicit runtime guardrails** for that path so operators get 
a clear signal before task execution fails in a misleading way.
   
   ## Impact of the change
   
   This adds a policy-driven runtime check only when kubeconfig exec auth 
actually uses `aws eks get-token`:
   
   - `warn` (default): emits an actionable warning if botocore is vulnerable 
(`< 1.40.2`) or version cannot be detected
   - `fail`: hard-fails early with a clear error to enforce platform policy
   - `ignore`: bypasses the check when users intentionally manage this 
externally
   
   Operational impact:
   - **Improves diagnosability** of a production issue that often appears as 
ambiguous `403`
   - **Reduces MTTR** by surfacing root-cause guidance at connection/auth setup 
time
   - **Adds governance controls** for teams that need strict enforcement 
(`fail`) without forcing everyone into that mode
   - **Keeps backwards compatibility** with default `warn`
   
   ## Scope and non-goals
   
   - Scope is intentionally limited to the AWS EKS exec-auth path (`aws eks 
get-token`) because this is the concrete failing path in #60943.
   - This PR does **not** change Kubernetes retry semantics for `403` 
responses, and does not change auth flow for non-AWS exec plugins.
   
   ## Configuration
   
   New Kubernetes connection extra:
   - `exec_auth_aws_cli_version_check_mode`: `warn` (default) | `fail` | 
`ignore`
   
   ## Validation
   
   - Added unit coverage for:
     - kubeconfig exec-auth detection (`aws eks get-token`)
     - botocore version parsing from `aws --version`
     - mode behavior (`warn`, `fail`, `ignore`, invalid fallback)
     - integration points in `get_conn` and default kubeconfig client path
   - Test command used:
     - `AIRFLOW_HOME=/tmp/airflow-60943 uv run --python 3.12 -m pytest 
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/hooks/test_kubernetes.py 
-q`
   
   closes #60943
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to