seanmuth opened a new issue, #68248:
URL: https://github.com/apache/airflow/issues/68248

   ## What happened
   
   The scheduler crashloops on deployments with historical `task_instance` 
records where `dag_version_id IS NULL`. These records exist on any deployment 
that was running before the `dag_version` table was introduced (migration 
`0047_3_0_0_add_dag_versioning`).
   
   The scheduler fails when it attempts to construct a `DagRunContext` using 
one of these historical TIs as `last_ti`:
   
   ```
   pydantic_core._pydantic_core.ValidationError: 1 validation error for 
DagRunContext
   last_ti.dag_version_id
     UUID input should be a string, bytes or UUID object [type=uuid_type, 
input_value=None, input_type=NoneType]
       For further information visit 
https://errors.pydantic.dev/2.13/v/uuid_type
   ```
   
   ## Airflow Version
   
   3.1.x (Astro Runtime 3.1-15)
   
   ## Steps to Reproduce
   
   1. Have a deployment with historical TI records predating `dag_version` 
(i.e. `task_instance.dag_version_id IS NULL`)
   2. Upgrade to Airflow 3.1.x
   3. Scheduler begins processing a DAG run whose `last_ti` is one of these 
historical records
   4. Scheduler crashloops
   
   ## Expected Behavior
   
   The scheduler should not crash when encountering a historical TI with 
`dag_version_id=None`, nor should it silently skip or ignore the associated DAG 
run. A reasonable fallback would be to substitute the most recent 
`dag_version_id` for the given `dag_id` when constructing `DagRunContext` — 
keeping the run in-flight while avoiding the validation error. Open to other 
approaches from the community.
   
   ## Actual Behavior
   
   Scheduler crashloops continuously. The only workaround is to backfill all 
historical TIs with a valid `dag_version_id`:
   
   ```sql
   -- Run in batches due to volume (can be 100M+ rows on long-running 
deployments)
   WITH latest_version AS (
       SELECT DISTINCT ON (dag_id) id, dag_id
       FROM dag_version
       ORDER BY dag_id, version_number DESC
   )
   UPDATE task_instance ti
   SET dag_version_id = lv.id
   FROM latest_version lv
   WHERE ti.dag_id = lv.dag_id
     AND ti.dag_version_id IS NULL;
   ```
   
   ## Additional Context
   
   - `dag_version_id` FK constraint was changed from `ON DELETE CASCADE` to `ON 
DELETE RESTRICT` in migration `0072_3_1_0` — tightening the relationship 
between TIs and dag_version rows makes this null scenario more impactful
   - On large deployments this backfill can affect 100M+ rows; a partial index 
on `(dag_id) WHERE dag_version_id IS NULL` is recommended before running
   - Related: #66177 (FK deadlock on `db clean` with dag_version)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to