Hello Airflow users,
We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard
installation, with the celery executor, one master node running the webserver,
scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and
python 3.6.10.
We have around 2300 dags. With version 1.10.10 the scheduler was able to
process all 2300 dags, although not efficiently, but it was working. With
version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags
and that’s when the scheduler started returning the below error:
2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a
heartbeat 50.59 seconds ago! Restarting it
2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of
all processes in the group: [9876]
2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876
2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876,
status='terminated', exitcode=0, started='01:43:47') (9876) terminated with
exit code 0
2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988
2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC')
We started a second scheduler on one of the worker nodes thinking it will help
with the load, but that did not make a difference, both schedulers returned the
same error message as above.
After more than 1 hour of the schedulers start time, there was sporadic
processing of some dags, but the rest of time, nothing but
DagFileProcessorManager error messages.
I came across a post this post
https://github.com/apache/airflow/discussions/19270 that suggested increasing
the value of scheduler_health_check_threshold, which I changed to 120, but it
did not solve the problem.
Any suggestions to how to fix this issue, or possibly downgrade to a different
version?
Thanks,
-mo