safaehar opened a new pull request, #68227: URL: https://github.com/apache/airflow/pull/68227
## Motivation `TriggerRunnerSupervisor.clean_unused()` calls `Trigger.clean_unused()`, which executes a `DELETE FROM trigger WHERE ...` that joins against `task_instance`. Under row-level lock contention — specifically when the triggerer's own `SELECT ... FOR UPDATE` queries hold locks on trigger rows while async coroutine work is in progress — the DELETE blocks waiting for those locks. If the wait exceeds the database `statement_timeout`, PostgreSQL raises `QueryCanceled`, which SQLAlchemy surfaces as `OperationalError`. This exception propagates unhandled up through `TriggerRunnerSupervisor.run()`, crashing the triggerer process into CrashLoopBackOff. This was observed in production (PostgreSQL metaDB, Airflow 3.2.1) across multiple workergroup deployments with 20–32 triggerer restarts over 3 days. ## Changes - Wrap `Trigger.clean_unused()` in `TriggerRunnerSupervisor.clean_unused()` to catch `OperationalError` and log a warning instead of propagating the exception - Add `sqlalchemy.exc` to the existing `sqlalchemy` import ## Why this is safe `clean_unused()` is best-effort periodic housekeeping. Orphaned trigger rows sitting in the database for one extra heartbeat cycle (~1s) have no functional impact — triggers still fire, deferrable tasks still run. The cleanup retries on the next heartbeat. Crashing the triggerer over a transient DB error is strictly worse than skipping one cleanup cycle. ## Alternatives considered - Fixing the lock contention directly: the triggerer's `SELECT ... FOR UPDATE` pattern is intentional for claiming triggers. Reducing contention via `idle_in_transaction_session_timeout` on the DB side helps but doesn't eliminate the race window. - Re-raising after N consecutive failures: adds complexity for limited benefit — a persistent DB outage would surface through other signals (heartbeat failures, liveness probes) before the retry count mattered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
