argibbs commented on issue #34339:
URL: https://github.com/apache/airflow/issues/34339#issuecomment-1717644586

   As is always the way: after sitting on this problem for days before raising 
the issue, I have just noticed that we are getting occasional timeouts in the 
dag processor manager on the dags that are most frequently exhibiting the 
problem. (Why we're hitting the timeout is a separate problem, but baby steps)
   
   After dropping down to a single scheduler, this was manifesting as the dags 
dropping out of the gui then reappearing, which is how I discovered it. (Aside: 
given how critical the dag processor manager's core loop is to airflow 
reliability, I feel like it gets nowhere near as much error reporting as it 
should do. Really, the GUI should be flagging up process timeouts).
   
   When running with multiple schedulers, we never noticed this flickering in 
and out of existence in the GUI. Total guess, but maybe this was because there 
was always at least one of the schedulers which had recently processed the dag 
ok... 🤷 
   
   My working hypothesis is now that a scheduler would timeout processing the 
dags, and this would somehow cause all the active tasks in the affected dags to 
be blatted as failed. (Insert suitable jazz hand waving over the specifics). I 
checked a few of the failures I've seen, and I do see timeouts in the processor 
at roughly the same time that the gantt shows the tasks being blatted as 
failed, despite actually running ok.
   
   Anyhoo, I am now running two experiments:
   1. Multiple schedulers + increased timeout.
   2. Single scheduler + increased timeout.
   
   Note:
   Obviously, the third experiment is:
   3. Single scheduler + default timeout.
   
   I have been running this (a single scheduler + default timeout) for several 
days in one env, and the problem seemed to have gone away (which is why I was 
suspicious of multiple schedulers), but I have just checked the dag processor 
logs for that env, and it simply seems to have not been experiencing timeouts, 
so I guess it's possible I simply picked a less contested box for that sole 
scheduler. Or maybe timeouts and single scheduler is fine, and it's timeouts + 
multiple schedulers that's the problem. Or maybe I'm chasing a red herring.
   
   I'll update if I find more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to