Re: [PR] Chart: Reconcile metadata DB bidirectionally on helm upgrade [airflow]

via GitHub Wed, 10 Jun 2026 03:14:31 -0700


jykae commented on code in PR #68074:
URL: https://github.com/apache/airflow/pull/68074#discussion_r3387388192



##########
chart/newsfragments/68074.significant.rst:
##########
@@ -0,0 +1,42 @@
+Helm chart now reconciles the Airflow metadata DB bidirectionally on ``helm 
upgrade``
+
+The ``migrate-database-job`` is now run as a ``post-install,pre-upgrade`` hook
+(was ``post-install,post-upgrade``) and its default args run a reconciliation
+script that picks one of four actions depending on the relationship between
+the chart's ``airflowVersion`` and the alembic head currently in the database:
+
+* ``fresh install`` (no alembic row) — ``airflow db migrate``.
+* ``target == current`` — no-op.
+* ``target > current`` — ``airflow db migrate`` (existing behaviour).
+* ``target < current`` — ``airflow db downgrade --to-version <target>``
+  executed inside the still-running api-server pod (so reverse alembic
+  scripts that ship only in the source-version image are available),
+  followed by scaling every DB-touching workload of this release
+  (``api-server``, ``scheduler``, ``triggerer``, ``dag-processor``,
+  ``worker``) to ``0``. Helm then patches those workloads back to their
+  configured ``replicas`` with the TARGET image as the upgrade proceeds,
+  so the cluster comes back up cleanly on the target version without any
+  OLD code running against the now-downgraded schema. A downgrade is
+  therefore a brief outage rather than a zero-downtime rolling update —
+  this is unavoidable when the schema goes backwards.
+
+The action above is only decided once the metadata DB is reachable. If the DB
+is still unreachable after the connection-wait window
+(``DB_CONNECT_MAX_WAIT_SECONDS``, default ``120``), the job exits non-zero and
+is retried via the Job ``backoffLimit`` rather than proceeding with a migrate.
+
+A ``Role`` + ``RoleBinding`` granting ``pods`` (``get``, ``list``),
+``pods/exec`` (``get``), ``deployments`` / ``statefulsets`` (``get``, ``list``)
+and ``deployments/scale`` / ``statefulsets/scale`` (``patch``) to the
+existing ``migrate-database-job`` ``ServiceAccount`` are now rendered when

Review Comment:
   Good catch — the newsfragment now lists `pods/exec` (`get`, `create`) to 
match the Role, which grants both verbs so exec works under either the 
GET-based api-server client path or a classic POST-based exec.



##########
chart/files/db_migrate.py:
##########
@@ -0,0 +1,362 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Bidirectional Airflow metadata DB reconciliation for the helm chart.
+
+Decides at runtime whether the helm release wants a forward migrate, a
+downgrade, or a no-op, and runs the right command:
+
+* target == current  -> no-op (idempotent check)
+* target  > current  -> ``airflow db migrate`` inside this job's container
+  (uses the TARGET image, which ships forward scripts).
+* target  < current  -> ``airflow db downgrade --to-version <target>``
+  executed inside the still-running api-server pod (the OLD image still
+  ships the reverse scripts), followed by scaling every DB-touching
+  workload (api-server, scheduler, triggerer, dag-processor, worker) to
+  zero so that no OLD pod keeps talking to the now-downgraded schema. Helm
+  then patches those workloads back to ``replicas: N`` with the TARGET
+  image as the upgrade proceeds, so the cluster comes back up cleanly on
+  the target version. This means a downgrade trades the otherwise-broken
+  rolling-update window for a brief outage (which is unavoidable when the
+  schema goes backwards).
+
+Required env:
+
+* ``AIRFLOW_TARGET_VERSION`` - the version the chart is being 
upgraded/installed to.
+* ``POD_NAMESPACE`` - release namespace, injected via downward API.
+* ``RELEASE_NAME`` - the helm release name, used to scope the scale-down to
+  only the workloads owned by this release.
+
+Optional env:
+
+* ``MIGRATE_JOB_DRAIN_TIMEOUT_SECONDS`` - how long to wait for DB-touching
+  pods to terminate after scale-to-zero on a downgrade. Defaults to 300.
+  Increase when long-running worker tasks need more time to wind down.
+"""
+
+from __future__ import annotations
+
+import os
+import subprocess
+import sys
+import time
+
+from alembic.migration import MigrationContext
+from packaging.version import InvalidVersion, Version
+from sqlalchemy import text
+from sqlalchemy.exc import OperationalError
+from tenacity import (
+    retry,
+    retry_if_exception_type,
+    stop_after_attempt,
+    stop_after_delay,
+    wait_exponential,
+    wait_fixed,
+)
+
+from airflow.settings import engine
+
+# The kubernetes client is only used by the downgrade branch (exec ``airflow db
+# downgrade`` in the api-server pod, then scale DB-touching workloads to zero).
+# The base Airflow image does not ship the kubernetes client unless the
+# cncf.kubernetes provider is installed, so importing it unconditionally would
+# break the far more common forward / no-op / fresh paths. Import it lazily and
+# only fail when a downgrade is actually selected.
+try:
+    from kubernetes import client, config as k8s_config
+    from kubernetes.stream import stream
+
+    KUBERNETES_IMPORT_ERROR: ImportError | None = None
+except ImportError as exc:  # pragma: no cover - only on images without the 
k8s client
+    client = k8s_config = stream = None  # type: ignore[assignment]
+    KUBERNETES_IMPORT_ERROR = exc
+
+# ``_REVISION_HEADS_MAP`` is private today; a public accessor is being tracked
+# upstream so the chart can drop the leading underscore in a future release.
+from airflow.utils.db import _REVISION_HEADS_MAP
+
+
+def _resolve_target_rev(target: str) -> str | None:
+    """Return the alembic head for *target*, falling back to the nearest lower 
mapped version.
+
+    ``_REVISION_HEADS_MAP`` does not list every patch release — patches often
+    share the head of their minor's first release. Mirror Airflow's own CLI
+    behaviour by picking the highest mapped version that is ``<= target``.
+    """
+    if target in _REVISION_HEADS_MAP:
+        return _REVISION_HEADS_MAP[target]
+    try:
+        target_v = Version(target)
+    except InvalidVersion:
+        # ``AIRFLOW_TARGET_VERSION`` comes from chart config; a nonstandard or
+        # misconfigured ``airflowVersion`` (e.g. a dev build) is not PEP 440
+        # parseable. Treat it as an unknown target so :func:`decide_action`
+        # falls back to a conservative forward migrate instead of crashing.
+        return None
+    candidates = [(Version(v), rev) for v, rev in _REVISION_HEADS_MAP.items() 
if Version(v) <= target_v]
+    if not candidates:
+        return None
+    return max(candidates, key=lambda pair: pair[0])[1]
+
+
+def _db_connect_stop(retry_state):
+    # Evaluate ``DB_CONNECT_MAX_WAIT_SECONDS`` at retry time (not import time)
+    # so that operators -- and unit tests -- can tune the wait window without
+    # reloading the module. ``/entrypoint`` skips its DB-wait for non-airflow
+    # commands (we run as ``python -c ...``), so on a fresh install with a
+    # bundled postgres still starting, the first connect attempt races the DB.
+    # Default 120s matches the entrypoint's
+    # ``CONNECTION_CHECK_MAX_COUNT`` * ``CONNECTION_CHECK_SLEEP_TIME``.
+    delay = int(os.environ.get("DB_CONNECT_MAX_WAIT_SECONDS", "120"))
+    return stop_after_delay(delay)(retry_state)

Review Comment:
   Fixed — both timeout env vars now go through a shared `_int_env` helper that 
raises `SystemExit` with a clear "must be an integer number of seconds" message 
instead of letting an opaque `ValueError` bubble up from deep in the job.



##########
chart/files/db_migrate.py:
##########
@@ -0,0 +1,362 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Bidirectional Airflow metadata DB reconciliation for the helm chart.
+
+Decides at runtime whether the helm release wants a forward migrate, a
+downgrade, or a no-op, and runs the right command:
+
+* target == current  -> no-op (idempotent check)
+* target  > current  -> ``airflow db migrate`` inside this job's container
+  (uses the TARGET image, which ships forward scripts).
+* target  < current  -> ``airflow db downgrade --to-version <target>``
+  executed inside the still-running api-server pod (the OLD image still
+  ships the reverse scripts), followed by scaling every DB-touching
+  workload (api-server, scheduler, triggerer, dag-processor, worker) to
+  zero so that no OLD pod keeps talking to the now-downgraded schema. Helm
+  then patches those workloads back to ``replicas: N`` with the TARGET
+  image as the upgrade proceeds, so the cluster comes back up cleanly on
+  the target version. This means a downgrade trades the otherwise-broken
+  rolling-update window for a brief outage (which is unavoidable when the
+  schema goes backwards).
+
+Required env:
+
+* ``AIRFLOW_TARGET_VERSION`` - the version the chart is being 
upgraded/installed to.
+* ``POD_NAMESPACE`` - release namespace, injected via downward API.
+* ``RELEASE_NAME`` - the helm release name, used to scope the scale-down to
+  only the workloads owned by this release.
+
+Optional env:
+
+* ``MIGRATE_JOB_DRAIN_TIMEOUT_SECONDS`` - how long to wait for DB-touching
+  pods to terminate after scale-to-zero on a downgrade. Defaults to 300.
+  Increase when long-running worker tasks need more time to wind down.
+"""
+
+from __future__ import annotations
+
+import os
+import subprocess
+import sys
+import time
+
+from alembic.migration import MigrationContext
+from packaging.version import InvalidVersion, Version
+from sqlalchemy import text
+from sqlalchemy.exc import OperationalError
+from tenacity import (
+    retry,
+    retry_if_exception_type,
+    stop_after_attempt,
+    stop_after_delay,
+    wait_exponential,
+    wait_fixed,
+)
+
+from airflow.settings import engine
+
+# The kubernetes client is only used by the downgrade branch (exec ``airflow db
+# downgrade`` in the api-server pod, then scale DB-touching workloads to zero).
+# The base Airflow image does not ship the kubernetes client unless the
+# cncf.kubernetes provider is installed, so importing it unconditionally would
+# break the far more common forward / no-op / fresh paths. Import it lazily and
+# only fail when a downgrade is actually selected.
+try:
+    from kubernetes import client, config as k8s_config
+    from kubernetes.stream import stream
+
+    KUBERNETES_IMPORT_ERROR: ImportError | None = None
+except ImportError as exc:  # pragma: no cover - only on images without the 
k8s client
+    client = k8s_config = stream = None  # type: ignore[assignment]
+    KUBERNETES_IMPORT_ERROR = exc
+
+# ``_REVISION_HEADS_MAP`` is private today; a public accessor is being tracked
+# upstream so the chart can drop the leading underscore in a future release.
+from airflow.utils.db import _REVISION_HEADS_MAP
+
+
+def _resolve_target_rev(target: str) -> str | None:
+    """Return the alembic head for *target*, falling back to the nearest lower 
mapped version.
+
+    ``_REVISION_HEADS_MAP`` does not list every patch release — patches often
+    share the head of their minor's first release. Mirror Airflow's own CLI
+    behaviour by picking the highest mapped version that is ``<= target``.
+    """
+    if target in _REVISION_HEADS_MAP:
+        return _REVISION_HEADS_MAP[target]
+    try:
+        target_v = Version(target)
+    except InvalidVersion:
+        # ``AIRFLOW_TARGET_VERSION`` comes from chart config; a nonstandard or
+        # misconfigured ``airflowVersion`` (e.g. a dev build) is not PEP 440
+        # parseable. Treat it as an unknown target so :func:`decide_action`
+        # falls back to a conservative forward migrate instead of crashing.
+        return None
+    candidates = [(Version(v), rev) for v, rev in _REVISION_HEADS_MAP.items() 
if Version(v) <= target_v]
+    if not candidates:
+        return None
+    return max(candidates, key=lambda pair: pair[0])[1]
+
+
+def _db_connect_stop(retry_state):
+    # Evaluate ``DB_CONNECT_MAX_WAIT_SECONDS`` at retry time (not import time)
+    # so that operators -- and unit tests -- can tune the wait window without
+    # reloading the module. ``/entrypoint`` skips its DB-wait for non-airflow
+    # commands (we run as ``python -c ...``), so on a fresh install with a
+    # bundled postgres still starting, the first connect attempt races the DB.
+    # Default 120s matches the entrypoint's
+    # ``CONNECTION_CHECK_MAX_COUNT`` * ``CONNECTION_CHECK_SLEEP_TIME``.
+    delay = int(os.environ.get("DB_CONNECT_MAX_WAIT_SECONDS", "120"))
+    return stop_after_delay(delay)(retry_state)
+
+
+@retry(
+    stop=_db_connect_stop,
+    wait=wait_fixed(3),
+    retry=retry_if_exception_type(OperationalError),
+    reraise=True,
+)
+def _wait_for_db_ready() -> None:
+    """Block until the metadata DB accepts a connection.
+
+    Called once at the top of :func:`main` so that *every* downstream step
+    (``decide_action``, ``airflow db migrate`` subprocess, ``kubernetes``
+    api-server pod exec) can assume the DB is reachable. Without this, the
+    subprocess branch had no DB-wait of its own and would fail immediately
+    on a fresh install where the bundled postgres was still initialising,
+    causing ``BackoffLimitExceeded`` on the Job.
+    """
+    with engine.connect() as conn:
+        conn.execute(text("SELECT 1"))
+
+
+def decide_action(target: str) -> str:
+    """Return one of ``noop``, ``forward``, ``downgrade``, ``fresh``.
+
+    Assumes the DB is already reachable -- :func:`_wait_for_db_ready` must
+    have been called first.
+    """
+    target_rev = _resolve_target_rev(target)
+    if target_rev is None:
+        # Unknown target version (e.g. dev build). Be conservative: forward 
only.
+        return "forward"
+
+    with engine.connect() as conn:
+        current_rev = MigrationContext.configure(conn).get_current_revision()
+
+    if current_rev is None:
+        return "fresh"
+    if current_rev == target_rev:
+        return "noop"
+
+    # Reverse-lookup current_rev to determine which version it belongs to,
+    # then compare versions. If current_rev isn't mapped (dev/intermediate
+    # alembic revision) be conservative and forward-migrate rather than risk
+    # an incorrect downgrade.
+    rev_to_version = {rev: ver for ver, rev in _REVISION_HEADS_MAP.items()}
+    current_version = rev_to_version.get(current_rev)
+    if current_version is None:
+        return "forward"
+    if Version(current_version) > Version(target):
+        return "downgrade"
+    return "forward"
+
+
+@retry(
+    stop=stop_after_attempt(5),
+    wait=wait_exponential(multiplier=2, min=2, max=30),
+    retry=retry_if_exception_type(RuntimeError),
+    reraise=True,
+)
+def discover_api_server_pod(namespace: str) -> str:
+    """Return the name of a Running api-server pod in *namespace*.
+
+    Retries on ``RuntimeError`` so a rolling restart of the api-server
+    deployment (no Running pod for a few seconds) does not fail the job.
+    """
+    k8s_config.load_incluster_config()
+    api = client.CoreV1Api()
+    pods = api.list_namespaced_pod(
+        namespace=namespace,
+        label_selector="component=api-server",
+        field_selector="status.phase=Running",
+    ).items
+    if not pods:
+        raise RuntimeError(f"no Running api-server pod found in namespace 
{namespace}")
+
+    # Prefer Ready pods so we don't pick one mid-rollout.
+    ready = [
+        p for p in pods if any(c.type == "Ready" and c.status == "True" for c 
in (p.status.conditions or []))
+    ]
+    return (ready or pods)[0].metadata.name
+
+
+def run_downgrade_in_api_server(pod_name: str, namespace: str, target_version: 
str) -> int:
+    """Exec ``airflow db downgrade`` in the api-server pod via the Kubernetes 
API."""
+    k8s_config.load_incluster_config()
+    api = client.CoreV1Api()
+    command = ["airflow", "db", "downgrade", "--to-version", target_version, 
"--yes"]
+
+    resp = stream(
+        api.connect_get_namespaced_pod_exec,
+        pod_name,
+        namespace,
+        container="api-server",
+        command=command,
+        stderr=True,
+        stdin=False,
+        stdout=True,
+        tty=False,
+        _preload_content=False,
+    )
+    while resp.is_open():
+        resp.update(timeout=1)
+        if resp.peek_stdout():
+            sys.stdout.write(resp.read_stdout())
+            sys.stdout.flush()
+        if resp.peek_stderr():
+            sys.stderr.write(resp.read_stderr())
+            sys.stderr.flush()
+    returncode = resp.returncode
+    resp.close()
+    # Treat an unknown exit code as failure rather than success: if the stream
+    # closes without populating returncode we cannot prove the downgrade ran.
+    if returncode is None:
+        sys.stderr.write("[db_migrate] downgrade exec stream closed without an 
exit code\n")
+        return 1
+    return returncode
+
+
+# Components that talk to the metadata DB and are managed by this helm release.
+# A downgrade-then-scale-back sequence must drain all of them so no OLD code
+# keeps talking to the now-downgraded schema before helm rolls in TARGET pods.
+_DB_TOUCHING_COMPONENTS = (
+    "api-server",
+    "scheduler",
+    "triggerer",
+    "dag-processor",
+    "worker",
+)
+
+
+def scale_release_workloads_to_zero(
+    namespace: str, release_name: str, timeout_seconds: int | None = None
+) -> None:
+    """Scale all DB-touching workloads of this release to 0 and wait for drain.
+
+    Helm will patch the same Deployments/StatefulSets back to ``replicas: N``
+    when it applies the post-hook manifests, so we deliberately do NOT scale
+    them up again here.
+
+    The drain deadline defaults to ``MIGRATE_JOB_DRAIN_TIMEOUT_SECONDS`` (or
+    300s if unset). Operators with long-running worker tasks can raise this
+    via the ``migrateDatabaseJob.drainTimeoutSeconds`` chart value.
+    """
+    if timeout_seconds is None:
+        timeout_seconds = 
int(os.environ.get("MIGRATE_JOB_DRAIN_TIMEOUT_SECONDS", "300"))

Review Comment:
   Fixed — `MIGRATE_JOB_DRAIN_TIMEOUT_SECONDS` is now parsed via the same 
`_int_env` helper that exits with an actionable message on a non-integer value.



##########
chart/tests/helm_tests/airflow_aux/test_db_migrate_script.py:
##########
@@ -0,0 +1,516 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Unit tests for the embedded bidirectional reconciler 
``chart/files/db_migrate.py``.
+
+These exercise the runtime behaviour of the script itself (the helm template
+tests only check that the script is embedded with the right env/args/RBAC).
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import pathlib
+import types
+from unittest import mock
+
+import pytest
+from kubernetes import client as k8s_client
+from sqlalchemy.exc import OperationalError

Review Comment:
   Fixed — the kubernetes import in the test module is now guarded (falls back 
to `None`), and the downgrade-branch tests that need `spec=k8s_client...` are 
skipped via `@pytest.mark.skipif` so the suite stays importable and runnable on 
images without the kubernetes client, mirroring the reconciler's lazy import.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Chart: Reconcile metadata DB bidirectionally on helm upgrade [airflow]

Reply via email to