kalluripradeep commented on issue #63926:
URL: https://github.com/apache/airflow/issues/63926#issuecomment-4157513914

   Dug into this and opened a fix in #64503.
   
   The crash comes down to a race condition. When `verify_integrity` runs, it 
loads all task instances into the SQLAlchemy session and marks some of them 
dirty (removing or restoring state). If the executor happens to complete one of 
those tasks in between — before `session.flush()` is called — SQLAlchemy sees 
that the row it expected to update is gone or changed, and throws a 
`StaleDataError`.
   
   The reason this crashes the scheduler instead of being handled gracefully is 
that `StaleDataError` is an ORM-level exception, not a subclass of 
`DBAPIError`. So it bypassed both the `except IntegrityError` catch in 
`_create_task_instances` and the tenacity retry wrapper in 
`run_with_db_retries` — neither of them knew to handle it.
   
   The fix catches `StaleDataError` alongside `IntegrityError` in 
`_create_task_instances` (rolling back the session safely), and also adds it to 
the tenacity retry list so the scheduling loop retries instead of dying.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to