seifrajhi opened a new issue, #44652:
URL: https://github.com/apache/airflow/issues/44652
### Apache Airflow version
2.10.3
### If "Other Airflow 2 version" selected, which one?
_No response_
### What happened?
**Description:**
I have Airflow 2.10.3 deployed in AKS using the Helm chart, and everything
works fine. I tried to deploy the standalone DAG processor to run as a
standalone process. Here is my configuration:
```yaml
dagProcessor:
enabled: true
replicas: 2
revisionHistoryLimit: 5
resources:
requests:
cpu: 2500m
ephemeral-storage: 200Mi
memory: 2500Mi
limits:
ephemeral-storage: 200Mi
memory: 2500Mi
podAnnotations: *podAnnotations
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "airflow.workload"
operator: In
values:
- dagprocessor
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
component: dagprocessor
topologyKey: kubernetes.io/hostname
weight: 100
tolerations:
- key: "airflow.workload"
value: "dagprocessor"
operator: "Equal"
effect: "NoSchedule"
[core]
standalone_dag_processor: "True"
```
I managed to separate the DAG processor pods and the worker node pool.
However, I started encountering issues where DAGs appear and disappear
frequently when the DAG bag size is large. In my QA environment with only 500
DAGs, I don't have this issue, but in production with more than 2000 DAGs, this
happens frequently.
In the cluster activity, I see the DAG processor state turning red
(unhealthy) for a few seconds and then healthy again, in a non-ending cycle.
The error in the logs is:
```
sqlalchemy.exc.PendingRollbackError: This Session's transaction has been
rolled back due to a previous exception during flush. To begin a new
transaction with this Session, first issue Session.rollback(). Original
exception was: (psycopg2.errors.UniqueViolation) duplicate key value violates
unique constraint "serialized_dag_pkey"
DETAIL: Key (dag_id)=(demo-dag) already exists.
(Background on this error at: https://sqlalche.me/e/14/gkpj) (Background on
this error at: https://sqlalche.me/e/14/7s2a)
```
Any help or guidance on resolving this issue would be greatly appreciated.
### What you think should happen instead?
The DAG processor should handle a large number of DAGs without causing them
to appear and disappear frequently. The state of the DAG processor should
remain stable and not fluctuate between healthy and unhealthy.
### How to reproduce
- Run the DAG processor as a subprocess of a scheduler job.
- Migrate the DAG processor to run as a standalone process deployment.
- Deploy in an environment with a large number of DAGs (e.g., more than 2000
DAGs).
### Operating System
AKS 1.29, AzureLinux
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]