IggeeTianci opened a new issue, #68101:
URL: https://github.com/apache/airflow/issues/68101

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.0
   
   ### What happened and how to reproduce it?
   
   What happened:
   
     The dag-processor silently hangs on startup when its initial DB queries 
(deactivate_deleted_dags(), _scan_stale_dags()) encounter row-level lock 
contention. The process logs "Found N files for bundle dags-folder" and then 
produces no further output — no workers are
     spawned, no heartbeat is emitted, and no error is logged. The liveness 
probe eventually kills it, but this creates a self-reinforcing crash-loop 
because SIGKILL leaves the in-flight transaction as an idle in transaction 
session on PostgreSQL, which continues holding the lock that blocks the next 
restart.
   
   What I think should happen instead:
   
   1. DB queries in the startup path (_refresh_dag_bundles → 
deactivate_deleted_dags, deactivate_stale_dags) should have a statement_timeout 
or use SELECT ... NOWAIT / SET lock_timeout so they fail fast rather than 
blocking indefinitely. 
     2. If a lock cannot be acquired within a reasonable timeout (e.g., 30s), 
the processor should log a clear warning (e.g., "Cannot acquire lock on dag 
table — possible zombie session holding locks").
     3. The heartbeat() call in _run_parsing_loop should happen on a background 
thread or before blocking DB operations, so the liveness probe doesn't kill a 
processor that's simply waiting on a lock. Currently, heartbeat only fires once 
per loop iteration — if the loop is blocked on a DB query, it never heartbeats.
   
     How to reproduce:
   
     1. Deploy Airflow 3 with dag_file_processor_timeout < DB lock hold time
     2. Have a DAG that produces non-deterministic serialization (e.g., fetches 
a Variable at parse time that changes frequently), causing serialized_dag and 
dag_version tables to grow
     3. Kill the dag-processor with SIGKILL while it's mid-transaction on 
dag_version
     4. The orphaned PostgreSQL session holds locks; the next dag-processor 
instance blocks silently on startup
   
   
     Workaround:
   
     - Manually terminate zombie sessions via SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity WHERE state = 'idle in transaction'
     - Set idle_in_transaction_session_timeout to a short value (e.g., 5 
minutes) in the PostgreSQL/Aurora parameter group
   
     Suggested fix locations:
   
     - airflow-core/src/airflow/dag_processing/manager.py — 
_refresh_dag_bundles() (line ~650–720) and _run_parsing_loop() (line ~880)
     - Consider wrapping the startup DB operations with 
session.execute(text("SET LOCAL lock_timeout = '30s'")) or using SQLAlchemy's 
execution_options(timeout=30)
     - Consider moving heartbeat to a background timer so it's not gated on the 
main loop completing
   
     Operating environment:
   
     - Airflow 3.2.0
     - Aurora PostgreSQL 16 (serverless v2)
     - KubernetesExecutor, dedicated dag-processor deployment
     - Liveness probe: airflow jobs check --local --job-type DagProcessorJob 
with periodSeconds: 90, failureThreshold: 6
   
   ### What you think should happen instead?
   
   _No response_
   
   ### Operating System
   
   _No response_
   
   ### Deployment
   
   None
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to