romanzdk opened a new issue, #68317:
URL: https://github.com/apache/airflow/issues/68317

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.1
   
   ### What happened and how to reproduce it?
   
   We upgraded production from **Airflow 3.1.8** to **3.2.1**, hit scheduling 
issues, then attempted to roll back by:
   
   1. Deploying Airflow **3.1.8** image
   2. Running `airflow db downgrade -n 3.1.8`
   
   Alembic migrations completed successfully. However, **scheduler and 
triggerer immediately crashed** on startup with:
   
   ```
   KeyError: <Encoding.VAR: '__var'>
   ```
   
   in `BaseSerialization.deserialize`, while loading rows from the metadata 
database (e.g. during `_schedule_all_dag_runs` or trigger deserialization).
   
   ### Environment
   
   - **Upgrade path:** 3.1.8 → 3.2.1 → attempted rollback to 3.1.8
   - **Database:** PostgreSQL
   - **Executor:** KubernetesExecutor
   
   Example traceback (scheduler):
   
   ```
   File ".../scheduler_job_runner.py", line 1968, in _schedule_all_dag_runs
       callback_tuples = [(run, self._schedule_dag_run(run, session=session)) 
for run in dag_runs]
   ...
   File ".../airflow/utils/sqlalchemy.py", line 137, in process_result_value
       return BaseSerialization.deserialize(value)
   File ".../airflow/serialization/serialized_objects.py", line 897, in 
deserialize
       var = encoded_var[Encoding.VAR]
   KeyError: <Encoding.VAR: '__var'>
   ```
   
   Similar failures occur in the triggerer when deserializing 
`trigger.encrypted_kwargs`.
   
   ### Root cause (our analysis)
   
   There are **two separate layers** in the metadata DB:
   
   | Layer | What downgrade migrations handle | What they do NOT handle |
   |-------|-----------------------------------|-------------------------|
   | Schema | Tables, columns, alembic revision | — |
   | Row content | — | Serialized JSON blobs written while 3.2 was running |
   
   While Airflow 3.2.x was running, it wrote metadata using **SDK serde** (see 
[#59711](https://github.com/apache/airflow/pull/59711) and 3.2.1 release notes 
— serde moved to `airflow.sdk.serde`). Examples:
   
   - `trigger.encrypted_kwargs`
   - `dag_run.conf` and related serialized columns
   - Deferred task / trigger payloads
   
   Airflow **3.1.8** reads these via legacy `BaseSerialization.deserialize()`, 
which expects the `{__type, __var}` wrapper format. SDK-serde blobs do not have 
`__var` at the top level → `KeyError`.
   
   `airflow db downgrade` reverts the **schema** to 3.1.8-compatible structure 
but does **not** rewrite existing row payloads back to 3.1 serialization format.
   
   ### What we tried
   
   - `airflow db downgrade -n 3.1.8` — succeeds, but runtime still crashes
   - Deploying the old 3.1.8 application image — correct for code, insufficient 
for DB content
   - Manual cleanup (risky): `DELETE FROM trigger;` + failing stuck `dag_run` 
rows — unblocks partially but is not a safe general solution
   
   **Only clean rollback path:** restore PostgreSQL from a backup taken 
**before** the 3.2 upgrade.
   
   ### Expected behavior
   
   The upgrade/downgrade documentation should clearly state:
   
   1. **Downgrading Airflow major/minor versions is not fully supported** 
without a metadata DB backup/restore.
   2. **`airflow db downgrade` only reverts schema** (alembic migrations). It 
does not migrate serialized row content.
   3. After running 3.2.x against a database, rolling back to 3.1.x requires 
either:
      - Restoring a pre-3.2 DB backup, or
      - Manual cleanup of incompatible rows (triggers, active dag runs with 
3.2-format conf, etc.) — with data loss risk
   4. The 3.2 serde migration 
([#59711](https://github.com/apache/airflow/pull/59711)) affects trigger kwargs 
and related fields; this is not reversed on downgrade.
   
   Suggested doc locations:
   
   - Upgrade guide / release notes for 3.2.0 / 3.2.1
   - `docs/howto/upgrading.rst` or equivalent
   - `airflow db downgrade` CLI help text
   
   ### Actual behavior
   
   - Downgrade migrations report success
   - Users reasonably assume DB is compatible with 3.1.8
   - Scheduler/triggerer crashloop with opaque `KeyError: __var`
   - No guidance on which tables/rows are affected or how to recover
   
   ### Suggested diagnostic queries
   
   ```sql
   -- Triggers written under 3.2 SDK serde (may lack __var wrapper)
   SELECT id, classpath, LEFT(encrypted_kwargs::text, 120)
   FROM trigger
   LIMIT 20;
   
   -- Active dag runs that may carry 3.2-format conf
   SELECT dag_id, run_id, state, LEFT(conf::text, 120)
   FROM dag_run
   WHERE state IN ('running', 'queued')
     AND conf IS NOT NULL
     AND conf::text NOT LIKE '%__var%';
   ```
   
   ### Related issues / PRs
   
   - [#59711](https://github.com/apache/airflow/pull/59711) — SDK serde for 
trigger/next kwargs
   - [#64613](https://github.com/apache/airflow/issues/64613) — trigger 
deserialization errors with external-event DAGs
   - [#65973](https://github.com/apache/airflow/issues/65973) — asset trigger 
kwargs format change 3.1.8 → 3.2.1
   - [#65688](https://github.com/apache/airflow/pull/65688) — scheduler 
UniqueViolation on downgrade 3.2.0 → 3.1.x (schema-level fix, not serde data)
   - [#63434](https://github.com/apache/airflow/issues/63434), 
[#63444](https://github.com/apache/airflow/issues/63444), 
[#63535](https://github.com/apache/airflow/issues/63535) — other 3.2 → 3.1 
downgrade migration failures
   
   ### Why this matters
   
   Teams hitting issues on 3.2 may attempt downgrade as first recovery step. 
Schema-successful downgrade with runtime failure is worse than a clear 
"unsupported — restore from backup" message. We lost time debugging this as a 
dependency/version mismatch before identifying the serde data layer.
   
   ### What you think should happen instead?
   
   _No response_
   
   ### Operating System
   
   _No response_
   
   ### Deployment
   
   None
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to