dungpham91 opened a new pull request, #68265: URL: https://github.com/apache/airflow/pull/68265
## What this PR does Adds a `checksum/fernet-key` pod annotation to the scheduler, api-server, triggerer, and dag-processor deployment templates. This ensures that all Airflow component pods automatically perform a rolling restart whenever the Fernet key secret changes. ## Problem Statement When using the `KubernetesExecutor`, worker pods are created ephemerally for each task. These worker pods mount the Fernet key secret directly from Kubernetes at startup. However, the long-running Airflow components (scheduler, api-server, triggerer, dag-processor) load the Fernet key into memory at boot and never re-read it. If the Fernet key secret is rotated (e.g., the Helm `pre-install` hook regenerates it during a sync/upgrade), a **key mismatch** occurs: - Long-running pods still hold the **old** Fernet key in memory. - Newly spawned worker pods mount the **new** Fernet key from the Kubernetes secret. - Communication between workers and the API server (Execution API in Airflow 3.x) fails because tokens encrypted with the new key cannot be decrypted by components still using the old key. ## Symptoms Observed - Worker pods crash with `exit_code: 1` within seconds of starting. - No task logs are written to remote storage because the worker crashes before the log handler is initialized. - The Airflow UI shows: `Could not read served logs: Invalid URL 'http://:8793/log/...' No host supplied` (a secondary symptom — the worker pod is already deleted). - Scheduler logs report: `Pod phase: Failed, container_state: terminated, container_reason: Error, exit_code: 1` with no further detail. - The DAG processor reports zero parsing errors — DAG code itself is healthy. ## Root Cause Analysis Investigation on a production cluster revealed the following: ### Timeline of the incident ``` 2026-06-08 00:00 — Airflow pods start, loading the current Fernet key into memory. 2026-06-09 00:03 — ArgoCD syncs the Helm chart → the pre-install hook deletes and recreates the fernet-key secret with a NEW random value. 2026-06-09 01:30 — Scheduler dispatches a task → creates a worker pod. Worker pod mounts the NEW fernet key from K8s secret. Worker fails to authenticate with API server (still using OLD key). Worker crashes with exit_code: 1 after ~24 seconds. ``` ### Evidence collected **1. Fernet key secret was recently recreated:** ``` $ kubectl get secret airflow-fernet-key -n airflow -o json { "metadata": { "creationTimestamp": "2026-06-09T00:03:34Z", ← recreated after pods started "annotations": { "helm.sh/hook": "pre-install", "helm.sh/hook-delete-policy": "before-hook-creation" } } } ``` **2. Airflow pods were NOT restarted after the secret change:** ``` $ kubectl get pods -n airflow -l release=airflow -o custom-columns="NAME:.metadata.name,STARTED:.status.startTime" NAME STARTED airflow-api-server-55cc57dfdc-96drz 2026-06-08T00:00:51Z ← started BEFORE secret rotation airflow-dag-processor-745444d959-nkc99 2026-06-08T00:00:52Z airflow-scheduler-bdbc8f69f-p8q8w 2026-06-08T00:00:52Z airflow-triggerer-858d446677-h847v 2026-06-08T00:00:52Z ``` **3. Worker pods consistently crash:** ``` $ kubectl get events -n airflow --field-selector reason=BackOff LAST SEEN TYPE REASON OBJECT MESSAGE 2m Warning BackOff pod/daily-pipeline-trigger-lgt8wbd0 Back-off restarting failed container $ kubectl logs airflow-scheduler -c scheduler --tail=20 [warning] Task trigger_downstream.2 failed in pod airflow/daily-pipeline-trigger-u7hq5f9c. Pod phase: Failed, reason: None, container_state: terminated, container_reason: Error, exit_code: 1 [info] Deleting pod daily-pipeline-trigger-u7hq5f9c ``` **4. No logs on remote storage for today's runs** (worker crashes before log handler init): ``` # Checking remote log storage for today's DAG run: Found 0 log files ← no logs uploaded (worker crashed too fast) # Checking remote log storage for yesterday's DAG run: Found 6 log files ← yesterday's logs exist (before secret rotation) ``` **5. DAG code is healthy** — DAG processor reports 0 errors: ``` DAG File Processing Stats File Path # DAGs # Errors Last Duration dags/dag_daily_pipeline.py 3 0 0.09s dags/common/dag_dlt_factory.py 6 0 0.99s ``` ### Conclusion The Fernet key secret is regenerated by the Helm `pre-install` hook (using `randAlphaNum 32`) but **no mechanism exists to restart the long-running pods** that cached the old key. Worker pods mount the new key at creation time, creating a mismatch that causes immediate authentication failure against the Execution API server. ## Fix Add `checksum/fernet-key` to the pod template annotations in all four deployment templates: ```yaml checksum/fernet-key: {{ include (print $.Template.BasePath "/secrets/fernetkey-secret.yaml") . | sha256sum }} ``` This follows the same pattern already used for `checksum/metadata-secret`, `checksum/pgbouncer-config-secret`, and other secrets in these templates. When the rendered content of `fernetkey-secret.yaml` changes (i.e., the key is rotated), the annotation value changes, which triggers a rolling restart of all affected deployments. ## Files Changed - `chart/templates/scheduler/scheduler-deployment.yaml` - `chart/templates/api-server/api-server-deployment.yaml` - `chart/templates/triggerer/triggerer-deployment.yaml` - `chart/templates/dag-processor/dag-processor-deployment.yaml` ## Notes - This change is consistent with the existing annotation pattern used for `metadata-secret`, `result-backend-secret`, and `jwt-secret`. - The `worker-deployment.yaml` (CeleryExecutor workers) already has similar checksum annotations; KubernetesExecutor worker pods are ephemeral and always mount the current secret, so they do not need this annotation. - Users who provide their own Fernet key secret via `fernetKeySecretName` are unaffected — the template renders as empty when that value is set, so the checksum remains stable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
