GitHub user rtrindvg edited a discussion: Need help understanding total number 
of Dags oscillating on UI

I am in the middle of a migration from an Airflow in a Virtual Machine to a 
kubernetes cluster, for now in a staging environment. After a lot of 
configuration adjustments on the Helm values.yaml, the cluster seems to be 
stable and working fine. 

But for some reason, the UI sometimes shows less DAGs than what are available. 
For example, we have a total of 93 DAGs. After the initial load, which takes a 
couple of minutes, it becomes stable for some time. Than it reduces to a 
smaller number (like 64) and after a couple of minutes, it starts to go back up 
again, eventually returning to 93 again. We confirmed this is not any kind of 
browser cache. There were no restarts of any pods in the meantime, no changes 
to the cluster and no DAGs were changes as well. 

We are using git-sync in a non-persistent storage, like it's recommended in the 
docs. We activated the debug logs on it and it seems to be working fine, just 
downloading changes when the DAGs branch has changes, and they seem to be 
propagating quickly to all relevant pods.

The scheduler logs were not clear on any kind of errors which could justify the 
drop in total DAGs, except the following line:
```
[2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found (8) 
stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.
```
I am researching if this is relevant to the issue at hand, but unsuccessfully 
so far.

Another fix we tried was activating the non-default DAG processor, but the 
behavior is the same. I tried activating the processor verbose mode using the 
env parameter, unsuccessfully. The logs are mostly blank, so I have no clue if 
the DAG processor is the culprit.

We also replaced the CeleryExecutor to the KubernetesExecutor, because it is 
more suited to our purposes. We did not think it had any relation to the issue 
and, as expected, the behavior persists.

Since I am from the cloud-infra team, and have no previous experience in 
Airflow, can someone help me understand what could be the issue and possible 
next steps in diagnosing our environment?

We are using airflow 2.9.3 (since it's the most recent in the latest Helm 
available), python 3.12, in a custom Dockerfile. We are not extending the 
image, we are really customizing it, since we need to perform a couple of 
compilations and it was more optimal to do this prior to the airflow pip 
installs, to make the rebuild faster and the final image smaller. I did not 
know if it was safe to point the image to the latest Airflow available (since I 
assume an updated Helm would be published if this was the case), so we kept 
using it this one.

Embedding the DAGs onto the image is not an option, since they are changed 
constantly and the time to rebuild and the process of redeploying the cluster 
several times a day is not ideal for us. If updating the cluster to 2.10.3 is 
safe and has any known issues regarding this behavior, please point me in the 
right way.

Thanks for any tips!

GitHub link: https://github.com/apache/airflow/discussions/44495

----
This is an automatically sent email for commits@airflow.apache.org.
To unsubscribe, please send an email to: commits-unsubscr...@airflow.apache.org

Reply via email to