anmolxlight opened a new pull request, #68635:
URL: https://github.com/apache/airflow/pull/68635

   ## Summary
   
   Store a hash of `dag_version.version_data` to avoid loading and comparing 
the full JSON manifest on every DAG parse.
   
   ### Problem
   
   `SerializedDagModel.write_dag`'s "serialized hash unchanged" fast path 
refreshes `DagVersion.bundle_version` / `version_data` in place, comparing the 
full stored `version_data` against the incoming value:
   
   1. `_prefetch_dag_write_metadata` loads the **full** `DagVersion` row — 
including the entire `version_data` JSON — for every DAG in the bulk write.
   2. The steady-state same-bundle case re-compares the full `version_data` 
dict each parse.
   
   ### Solution
   
   Persist a `version_data_hash` (md5 of canonical JSON, `String(32)`, 
nullable) on `dag_version` and compare/prefetch that instead of the full blob:
   
   - **`DagVersion` model**: new `version_data_hash` column + 
`compute_version_data_hash()` static method
   - **`_prefetch_dag_write_metadata`**: uses `load_only()` to skip loading the 
`version_data` JSON column entirely
   - **Fast path comparison**: compares `version_data_hash` instead of full 
dicts
   - **In-place refresh**: updates `version_data_hash` when bundle metadata 
changes
   - **New `DagVersion` rows**: computed on creation
   
   ### Verification
   
   - All 66 `test_serialized_dag` tests pass
   - All 8 `test_dag_version` tests pass
   - All migrations chain correctly from latest `9ff64e1c35d3`
   
   Closes: #68567
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to