GitHub user potiuk edited a comment on the discussion: Need help understanding 
total number of Dags oscillating on UI

> [2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found 
> (8) stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.

That line is important. It tells that dag processor, did not parse some of your 
DAGs within 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration
  time (10 minutes) - the timestamps in your log file confirm that - simply 10 
minutes passed since the parsing of the DAG files produced the 8 DAG ids that 
were deactivated.

Now. The question is why and we need you to investigate several things to find 
out.

1) First of all - find out which dags disappeared (got deactivated) - that will 
make further investigations easier.. Those are DAGs in the DAG table of airflow 
that will have "active = False". This is what is going to change in your case - 
so all the dags that are "non-active"  - will disappear from the UI.

2) The important thing here is that there is no 1-1 relationship between DAG 
files and DAG ids. Sometimes (Dynamic DAG generation) - parsing one DAG file 
might produce more than one DAG id. This is Dynamic DAG Generation. So you need 
to find out  which of the FILES in DAG folder should have produced the DAG ids 
that got deactivated.

3) Once you know that, there are a couple of things that could go wrong.

**a) Your Dynamic DAG generation in your DAGs is buggy / unstable and produces 
different set of DAG ids every time it is parsed.**

Generally speaking, when you do Dynamic DAG generation, every time a DAG file 
is parsed, it shoudl produce consistently the same dag ids. Again - it is 
entirely up to you how the DAGs are written. There are different techniques you 
can use for dynamic DAG generation 
https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#dynamic-dags-with-external-configuration-from-a-structured-data-file
 - and it might be that the way how you do it is simply unstable. For example 
pulls information about the various DAGs generated as Json file and that file 
content changes in unpredictable / non stable ways. Or there is a bug in the 
DAG generation code that causes an exception sometimes and not all DAGs are 
generated. Generally - when you do `python your_dag_file.py` - parsing it 
should consitently create the same number of DAG "objects" created in python 
globals with the same ids.

This is most probable cause.

**b) some of your files are not parsed for some reasons. Generally airflow DAG 
file processor is parsing continuously all files and should go through all the 
files in a loop, but there are certain parameters that control it:**

* 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode
 - determines what is the sorting criteria used. 
* 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode
 - this is how often we check if certain dags were last parsed to see if the 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration
 has been exceeded,
* This is how many parallel parsing processes are run by each scheduler (or dag 
file process) if you have standalone dag file processor) 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-processes

There are few other parameters, but those are the important ones. 

**c)  Your DAGs do notfolow best practices.**

So one of other options is that simply parsing of some of your DAG files takes 
a long time - long enough that it takes more than 1- minutes for the dag file 
processor to go in a look to parse all your files. If you follow best practices 
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#reducing-dag-complexity
 - and make sure that you do not block or spend a lot of time in parsing your 
top-level code of DAG files 
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
  - parsing of even 100s of files should take seconds at most. But there are 
cases - especially when your top-level code is reaching out to external sources 
and possibly "hangs" or takes a lot of time to complete parsing, your parsing 
can take arbitrary long - minutes or hours. 

If you have some of your DAG files doing it, they might simply hang for a long 
time - if you have 2 parallel dag parsing processes running - it's enough that 
you have two files where parsing takes around 10 minutes and that will cause 
some of the remaining files to not be parsed "on time".  Similarly if you have 
many DAG files that are parsed in few minutes each - that might delay the queue 
enough, that some of your DAGs will not parse within default 10 minutes

You need to make sure to optimize your parsing time and follow best practices - 
ideally, all your DAGs should be parsed in seconds rather than minutes. Of 
course you can also play with the parameters above, and make timeouts bigger, 
but that is rather masking the "long parsing being problematic" rather than 
solve the problem and will results in for example far longer times on 
reflecting dag file changes into parsed DAGs in DB. 

You can control DAG parsing timeout 
https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-control-dag-file-parsing-timeout-for-different-dag-files
 and the next section 
https://airflow.apache.org/docs/apache-airflow/stable/faq.html#when-there-are-a-lot-1000-of-dag-files-how-to-speed-up-parsing-of-new-files
 explains some of the ways you can attempt speed up parsing.

This is the second most probable cause for what you see.

**d) your syncing process might have some on/off states where the files are 
appearing / disappearing after check-out.** 

Git-sync works in the way that it checks out the latest commit and then swaps 
out with previous check-out via symbolic link. Maybe there is 
somethign in that process  that causes it? Maybe for example ther are some 
permission problems that prevent the files to be parsed etc. etc. 

Not very likely though

**e) it might be that some of the combination of the parameters above and your 
synchronization settings above causes bad sorting of parsed files **

It might simply cause that the processors are not parsing your files - there 
might be various reasons - for example if your sorting order is `modified time` 
and for some reason your syncing process causes mtime to be modified 
continuously, it might well be that dag processor will only ever attempt to 
parse "last modified" files and your "non-modified files" will be always put at 
the end of the queue - that might cause similar issues - you must look for an 
indication (in dag file processor logs) what files are being parsed. This might 
also happen if the filesystem of yours  badly handles mtime coming from git - 
or when your git is configured to not preserve modified time that is coming 
from git. 

This is also quite unlikely reason, but if you use non POSIX compliant shared 
filesystems, i can imagine it can happen.

**f) Finally - time on your various machines might not be synchronized.**

If one of the machines (DB, scheduler, processor, git)  has a significant drift 
of time, it might be that causes various time calculation wrong. 

Not very likely, most of the computing resources out there have ntp and similar 
way of syncing time, but we've seen that happening.

Good luck with your investigation and please come back here with the results.


GitHub link: 
https://github.com/apache/airflow/discussions/44495#discussioncomment-11421723

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to