dungpham91 opened a new pull request, #68265:
URL: https://github.com/apache/airflow/pull/68265

   ## What this PR does
   
   Adds a `checksum/fernet-key` pod annotation to the scheduler, api-server, 
triggerer, and dag-processor deployment templates. This ensures that all 
Airflow component pods automatically perform a rolling restart whenever the 
Fernet key secret changes.
   
   ## Problem Statement
   
   When using the `KubernetesExecutor`, worker pods are created ephemerally for 
each task. These worker pods mount the Fernet key secret directly from 
Kubernetes at startup. However, the long-running Airflow components (scheduler, 
api-server, triggerer, dag-processor) load the Fernet key into memory at boot 
and never re-read it.
   
   If the Fernet key secret is rotated (e.g., the Helm `pre-install` hook 
regenerates it during a sync/upgrade), a **key mismatch** occurs:
   
   - Long-running pods still hold the **old** Fernet key in memory.
   - Newly spawned worker pods mount the **new** Fernet key from the Kubernetes 
secret.
   - Communication between workers and the API server (Execution API in Airflow 
3.x) fails because tokens encrypted with the new key cannot be decrypted by 
components still using the old key.
   
   ## Symptoms Observed
   
   - Worker pods crash with `exit_code: 1` within seconds of starting.
   - No task logs are written to remote storage because the worker crashes 
before the log handler is initialized.
   - The Airflow UI shows: `Could not read served logs: Invalid URL 
'http://:8793/log/...' No host supplied` (a secondary symptom — the worker pod 
is already deleted).
   - Scheduler logs report: `Pod phase: Failed, container_state: terminated, 
container_reason: Error, exit_code: 1` with no further detail.
   - The DAG processor reports zero parsing errors — DAG code itself is healthy.
   
   ## Root Cause Analysis
   
   Investigation on a production cluster revealed the following:
   
   ### Timeline of the incident
   
   ```
   2026-06-08 00:00 — Airflow pods start, loading the current Fernet key into 
memory.
   2026-06-09 00:03 — ArgoCD syncs the Helm chart → the pre-install hook deletes
                      and recreates the fernet-key secret with a NEW random 
value.
   2026-06-09 01:30 — Scheduler dispatches a task → creates a worker pod.
                      Worker pod mounts the NEW fernet key from K8s secret.
                      Worker fails to authenticate with API server (still using 
OLD key).
                      Worker crashes with exit_code: 1 after ~24 seconds.
   ```
   
   ### Evidence collected
   
   **1. Fernet key secret was recently recreated:**
   
   ```
   $ kubectl get secret airflow-fernet-key -n airflow -o json
   {
     "metadata": {
       "creationTimestamp": "2026-06-09T00:03:34Z",   ← recreated after pods 
started
       "annotations": {
         "helm.sh/hook": "pre-install",
         "helm.sh/hook-delete-policy": "before-hook-creation"
       }
     }
   }
   ```
   
   **2. Airflow pods were NOT restarted after the secret change:**
   
   ```
   $ kubectl get pods -n airflow -l release=airflow -o 
custom-columns="NAME:.metadata.name,STARTED:.status.startTime"
   NAME                                    STARTED
   airflow-api-server-55cc57dfdc-96drz     2026-06-08T00:00:51Z   ← started 
BEFORE secret rotation
   airflow-dag-processor-745444d959-nkc99  2026-06-08T00:00:52Z
   airflow-scheduler-bdbc8f69f-p8q8w       2026-06-08T00:00:52Z
   airflow-triggerer-858d446677-h847v      2026-06-08T00:00:52Z
   ```
   
   **3. Worker pods consistently crash:**
   
   ```
   $ kubectl get events -n airflow --field-selector reason=BackOff
   LAST SEEN   TYPE      REASON    OBJECT                                MESSAGE
   2m          Warning   BackOff   pod/daily-pipeline-trigger-lgt8wbd0   
Back-off restarting failed container
   
   $ kubectl logs airflow-scheduler -c scheduler --tail=20
   [warning] Task trigger_downstream.2 failed in pod 
airflow/daily-pipeline-trigger-u7hq5f9c.
   Pod phase: Failed, reason: None, container_state: terminated, 
container_reason: Error, exit_code: 1
   [info] Deleting pod daily-pipeline-trigger-u7hq5f9c
   ```
   
   **4. No logs on remote storage for today's runs** (worker crashes before log 
handler init):
   
   ```
   # Checking remote log storage for today's DAG run:
   Found 0 log files   ← no logs uploaded (worker crashed too fast)
   
   # Checking remote log storage for yesterday's DAG run:
   Found 6 log files   ← yesterday's logs exist (before secret rotation)
   ```
   
   **5. DAG code is healthy** — DAG processor reports 0 errors:
   
   ```
   DAG File Processing Stats
   File Path                              # DAGs  # Errors  Last Duration
   dags/dag_daily_pipeline.py             3       0         0.09s
   dags/common/dag_dlt_factory.py         6       0         0.99s
   ```
   
   ### Conclusion
   
   The Fernet key secret is regenerated by the Helm `pre-install` hook (using 
`randAlphaNum 32`) but **no mechanism exists to restart the long-running pods** 
that cached the old key. Worker pods mount the new key at creation time, 
creating a mismatch that causes immediate authentication failure against the 
Execution API server.
   
   ## Fix
   
   Add `checksum/fernet-key` to the pod template annotations in all four 
deployment templates:
   
   ```yaml
   checksum/fernet-key: {{ include (print $.Template.BasePath 
"/secrets/fernetkey-secret.yaml") . | sha256sum }}
   ```
   
   This follows the same pattern already used for `checksum/metadata-secret`, 
`checksum/pgbouncer-config-secret`, and other secrets in these templates. When 
the rendered content of `fernetkey-secret.yaml` changes (i.e., the key is 
rotated), the annotation value changes, which triggers a rolling restart of all 
affected deployments.
   
   ## Files Changed
   
   - `chart/templates/scheduler/scheduler-deployment.yaml`
   - `chart/templates/api-server/api-server-deployment.yaml`
   - `chart/templates/triggerer/triggerer-deployment.yaml`
   - `chart/templates/dag-processor/dag-processor-deployment.yaml`
   
   ## Notes
   
   - This change is consistent with the existing annotation pattern used for 
`metadata-secret`, `result-backend-secret`, and `jwt-secret`.
   - The `worker-deployment.yaml` (CeleryExecutor workers) already has similar 
checksum annotations; KubernetesExecutor worker pods are ephemeral and always 
mount the current secret, so they do not need this annotation.
   - Users who provide their own Fernet key secret via `fernetKeySecretName` 
are unaffected — the template renders as empty when that value is set, so the 
checksum remains stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to