GitHub user potiuk edited a comment on the discussion: Need help understanding
total number of Dags oscillating on UI
> [2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found
> (8) stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.
That line is important. It tells that dag processor, did not parse some of your
DAGs within
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration
time (10 minutes) - the timestamps in your log file confirm that - simply 10
minutes passed since the parsing of the DAG files produced the 8 DAG ids that
were deactivated.
Now. The question is why and we need you to investigate several things to find
out.
1) First of all - find out which dags disappeared (got deactivated) - that will
make further investigations easier.. Those are DAGs in the DAG table of airflow
that will have "active = False". This is what is going to change in your case -
so all the dags that are "non-active" - will disappear from the UI.
2) The important thing here is that there is no 1-1 relationship between DAG
files and DAG ids. Sometimes (Dynamic DAG generation) - parsing one DAG file
might produce more than one DAG id. This is Dynamic DAG Generation. So you need
to find out which of the FILES in DAG folder should have produced the DAG ids
that got deactivated.
3) Once you know that, there are a couple of things that could go wrong.
**a) Your Dynamic DAG generation in your DAGs is buggy / unstable and produces
different set of DAG ids every time it is parsed.**
Generally speaking, when you do Dynamic DAG generation, every time a DAG file
is parsed, it shoudl produce consistently the same dag ids. Again - it is
entirely up to you how the DAGs are written. There are different techniques you
can use for dynamic DAG generation
https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#dynamic-dags-with-external-configuration-from-a-structured-data-file
- and it might be that the way how you do it is simply unstable. For example
pulls information about the various DAGs generated as Json file and that file
content changes in unpredictable / non stable ways. Or there is a bug in the
DAG generation code that causes an exception sometimes and not all DAGs are
generated. Generally - when you do `python your_dag_file.py` - parsing it
should consitently create the same number of DAG "objects" created in python
globals with the same ids.
This is most probable cause.
**b) some of your files are not parsed for some reasons. Generally airflow DAG
file processor is parsing continuously all files and should go through all the
files in a loop, but there are certain parameters that control it:**
*
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode
- determines what is the sorting criteria used.
*
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode
- this is how often we check if certain dags were last parsed to see if the
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration
has been exceeded,
* This is how many parallel parsing processes are run by each scheduler (or dag
file process) if you have standalone dag file processor)
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-processes
There are few other parameters, but those are the important ones.
**c) Your DAGs do notfolow best practices.**
So one of other options is that simply parsing of some of your DAG files takes
a long time - long enough that it takes more than 1- minutes for the dag file
processor to go in a look to parse all your files. If you follow best practices
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#reducing-dag-complexity
- and make sure that you do not block or spend a lot of time in parsing your
top-level code of DAG files
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
- parsing of even 100s of files should take seconds at most. But there are
cases - especially when your top-level code is reaching out to external sources
and possibly "hangs" or takes a lot of time to complete parsing, your parsing
can take arbitrary long - minutes or hours.
If you have some of your DAG files doing it, they might simply hang for a long
time - if you have 2 parallel dag parsing processes running - it's enough that
you have two files where parsing takes around 10 minutes and that will cause
some of the remaining files to not be parsed "on time". Similarly if you have
many DAG files that are parsed in few minutes each - that might delay the queue
enough, that some of your DAGs will not parse within default 10 minutes
You need to make sure to optimize your parsing time and follow best practices -
ideally, all your DAGs should be parsed in seconds rather than minutes. Of
course you can also play with the parameters above, and make timeouts bigger,
but that is rather masking the "long parsing being problematic" rather than
solve the problem and will results in for example far longer times on
reflecting dag file changes into parsed DAGs in DB.
You can control DAG parsing timeout
https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-control-dag-file-parsing-timeout-for-different-dag-files
and the next section
https://airflow.apache.org/docs/apache-airflow/stable/faq.html#when-there-are-a-lot-1000-of-dag-files-how-to-speed-up-parsing-of-new-files
explains some of the ways you can attempt speed up parsing.
This is the second most probable cause for what you see.
**d) your syncing process might have some on/off states where the files are
appearing / disappearing after check-out.**
Git-sync works in the way that it checks out the latest commit and then swaps
out with previous check-out via symbolic link. Maybe there is
somethign in that process that causes it? Maybe for example ther are some
permission problems that prevent the files to be parsed etc. etc.
Not very likely though
**e) it might be that some of the combination of the parameters above and your
synchronization settings above causes bad sorting of parsed files **
It might simply cause that the processors are not parsing your files - there
might be various reasons - for example if your sorting order is `modified time`
and for some reason your syncing process causes mtime to be modified
continuously, it might well be that dag processor will only ever attempt to
parse "last modified" files and your "non-modified files" will be always put at
the end of the queue - that might cause similar issues - you must look for an
indication (in dag file processor logs) what files are being parsed. This might
also happen if the filesystem of yours badly handles mtime coming from git -
or when your git is configured to not preserve modified time that is coming
from git.
This is also quite unlikely reason, but if you use non POSIX compliant shared
filesystems, i can imagine it can happen.
**f) Finally - time on your various machines might not be synchronized.**
If one of the machines (DB, scheduler, processor, git) has a significant drift
of time, it might be that causes various time calculation wrong.
Not very likely, most of the computing resources out there have ntp and similar
way of syncing time, but we've seen that happening.
Good luck with your investigation and please come back here with the results.
GitHub link:
https://github.com/apache/airflow/discussions/44495#discussioncomment-11421723
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]