AmosG opened a new pull request, #59167:
URL: https://github.com/apache/airflow/pull/59167

   Fix DAG processor crash on MySQL connection failure during import error 
recording
   
   Fixes #59166
   
   The DAG processor was crashing when MySQL connection failures occurred while
   recording DAG import errors to the database. The root cause was missing
   session.rollback() calls after caught exceptions, leaving the SQLAlchemy
   session in an invalid state. When session.flush() was subsequently called,
   it would raise a new exception that wasn't caught, causing the DAG processor
   to crash and enter restart loops.
   
   This issue was observed in production environments where the DAG processor
   would restart 1,259 times in 4 days (~13 restarts/hour), leading to:
   - Connection pool exhaustion
   - Cascading failures across Airflow components
   - Import errors not being recorded in the UI
   - System instability
   
   ## Changes
   
   - Add `session.rollback()` after caught exceptions in 
`_update_import_errors()`
   - Add `session.rollback()` after caught exceptions in 
`_update_dag_warnings()`
   - Wrap `session.flush()` in try-except with `session.rollback()` on failure
   - Add comprehensive unit tests for all failure scenarios
   - Update comments to clarify error handling behavior
   
   ## Testing
   
   Added 5 new unit tests in `TestDagProcessorCrashFix` class:
   - `test_update_dag_parsing_results_handles_db_failure_gracefully`
   - 
`test_update_dag_parsing_results_handles_dag_warnings_db_failure_gracefully`
   - `test_update_dag_parsing_results_handles_session_flush_failure_gracefully`
   - `test_session_rollback_called_on_import_errors_failure`
   - `test_session_rollback_called_on_dag_warnings_failure`
   
   All tests pass and verify that:
   1. Database failures don't crash the DAG processor
   2. `session.rollback()` is called correctly on failures
   3. The processor continues gracefully after errors
   
   ## Impact
   
   The fix ensures the DAG processor gracefully handles database connection
   failures and continues processing other DAGs instead of crashing, preventing
   production outages from restart loops.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to