potiuk commented on issue #27100:
URL: https://github.com/apache/airflow/issues/27100#issuecomment-1296472304

   > I've seen tasks getting stuck silently inside the airflow db check 
command, which is part of the Entrypoint of the airflow docker container. It 
has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 
20, that get multiplied with your connect timeout which can be very long by 
default, maybe even infinite? I've seen examples where it get stuck hanging 
here for hours even after the DB is recovered.
   
   Ideas on other strategies? What have you see working for you @hterik ? I 
think we can improve that - current defaults have been takenf from the original 
Astronomer image, but maybe we can do better? WDYT?
   
   > Another problem with the scheduler is that if one of the threads inside 
crash, the process still keeps running. You need to monitor the scheduler 
heartbeat from externally and restart the scheduler whenever it becomes 
unhealthy. This became a lot easier in 2.4 which now has a dedicated 
health-probe for scheduler. If this is the problem, it should be visible with a 
banner on the top of the web page.
   
   This is interesting and should not (generally) happen. Do you have an 
example of that @hterik? IMHO that is exactly what is my point about "crashing 
hard whenever any crash occured. Seeing examples of when it happened would be 
super helpful (for reproduction and fix).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to