kalluripradeep commented on issue #63926: URL: https://github.com/apache/airflow/issues/63926#issuecomment-4157513914
Dug into this and opened a fix in #64503. The crash comes down to a race condition. When `verify_integrity` runs, it loads all task instances into the SQLAlchemy session and marks some of them dirty (removing or restoring state). If the executor happens to complete one of those tasks in between — before `session.flush()` is called — SQLAlchemy sees that the row it expected to update is gone or changed, and throws a `StaleDataError`. The reason this crashes the scheduler instead of being handled gracefully is that `StaleDataError` is an ORM-level exception, not a subclass of `DBAPIError`. So it bypassed both the `except IntegrityError` catch in `_create_task_instances` and the tenacity retry wrapper in `run_with_db_retries` — neither of them knew to handle it. The fix catches `StaleDataError` alongside `IntegrityError` in `_create_task_instances` (rolling back the session safely), and also adds it to the tenacity retry list so the scheduling loop retries instead of dying. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
