dimon222 opened a new issue, #56959:
URL: https://github.com/apache/airflow/issues/56959

   ### Apache Airflow version
   
   2.11.0
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   I recently switched to standalone dag processor pod as part of solution for 
#56294,  but noticed new behavior: the dag processor at random time of a day 
would hang perpetually with no subprocess movement (it would hang and set of 3 
subprocesses for dags would be left with no progress). I tried forcefully kill 
few of these dagfileprocessor threads to see if if recovers but no success. The 
logs would also stop producing anything. The metrics graph in k8s would still 
show a bit of activity for the CPU but nowhere near same utilization when it 
does work correctly.
   No exceptions or anything meaningful in logs, just everything works 
correctly and one moment it stops producing all logs and processing dags. 
   
   ### What you think should happen instead?
   
   DagProcessor should recycle subprocesses that might have timed out beyond 
configured import timeout or be able to self-recover, or at least gracefully 
crash instead of freezing. 
   
   ### How to reproduce
   
   Unable to determine root cause to replicate consistently. 
   
   ### Operating System
   
   UBI9 (RHEL9)
   
   ### Versions of Apache Airflow Providers
   
   Latest as of constraints-3.12.txt for Airflow 2.11.0
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   OpenShift  (k8s), pip virtualenv install of airflow on Python 3.12 in UBI9 
image. Celery executor.
   
   Stack is split on pods:   
   1. webserver pod.  
   2. celery scheduler pod.  
   3. celery worker pod.  
   4. dag processor pod.  
   5. postgres pod.  
   6. PVC (network mount) used to share dag catalog between airflow pods    
   
   ### Anything else?
   
   Seem to happening randomly in period 1-5 days after launch
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to