kalluripradeep opened a new pull request, #64503:
URL: https://github.com/apache/airflow/pull/64503

   When LocalExecutor runs with high parallelism, a race condition can occur:
   a task instance is completed/deleted between the time
   `_check_for_removed_or_restored_tasks` loads TIs into the session and
   the time `session.flush()` is called inside `_create_task_instances`.
   
   This raises a `StaleDataError` (SQLAlchemy ORM optimistic locking
   violation) which was previously uncaught — crashing the scheduler
   entirely instead of recovering gracefully.
   
   The key reason it slipped through: `StaleDataError` is **not** a
   subclass of `DBAPIError`, so it bypassed both the
   `except IntegrityError` guard in `_create_task_instances` **and** the
   tenacity retry wrapper in `run_with_db_retries`.
   
   **Changes:**
   - Catch `StaleDataError` alongside `IntegrityError` in
     `_create_task_instances` and roll back the session safely
   - Add `StaleDataError` to the tenacity retry list in
     `run_with_db_retries` so the scheduling loop retries the transient
     race condition
   
   **Tests added:**
   - `test_verify_integrity_handles_stale_data_error` — verifies
     `StaleDataError` during `session.flush()` is caught and
     `session.rollback()` is called
   - `test_retry_db_transaction_with_stale_data_error` — verifies
     `StaleDataError` is retried 3 times by `run_with_db_retries`
   
   Fixes #63926
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to