seifrajhi opened a new issue, #44652:
URL: https://github.com/apache/airflow/issues/44652

   ### Apache Airflow version
   
   2.10.3
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   **Description:**
   
   I have Airflow 2.10.3 deployed in AKS using the Helm chart, and everything 
works fine. I tried to deploy the standalone DAG processor to run as a 
standalone process. Here is my configuration:
   
   ```yaml
   dagProcessor:
     enabled: true
     replicas: 2
     revisionHistoryLimit: 5
     resources:
       requests:
           cpu: 2500m
           ephemeral-storage: 200Mi
           memory: 2500Mi
       limits:
           ephemeral-storage: 200Mi
           memory: 2500Mi
     podAnnotations: *podAnnotations
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: "airflow.workload"
               operator: In
               values:
               - dagprocessor
       podAntiAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
         - podAffinityTerm:
             labelSelector:
               matchLabels:
                 component: dagprocessor
             topologyKey: kubernetes.io/hostname
           weight: 100
     tolerations:
       - key: "airflow.workload"
         value: "dagprocessor"
         operator: "Equal"
         effect: "NoSchedule"
   
   [core]
     standalone_dag_processor: "True"
   ```
   
   I managed to separate the DAG processor pods and the worker node pool. 
However, I started encountering issues where DAGs appear and disappear 
frequently when the DAG bag size is large. In my QA environment with only 500 
DAGs, I don't have this issue, but in production with more than 2000 DAGs, this 
happens frequently.
   
   In the cluster activity, I see the DAG processor state turning red 
(unhealthy) for a few seconds and then healthy again, in a non-ending cycle. 
The error in the logs is:
   
   ```
   sqlalchemy.exc.PendingRollbackError: This Session's transaction has been 
rolled back due to a previous exception during flush. To begin a new 
transaction with this Session, first issue Session.rollback(). Original 
exception was: (psycopg2.errors.UniqueViolation) duplicate key value violates 
unique constraint "serialized_dag_pkey"
   DETAIL:  Key (dag_id)=(demo-dag) already exists.
   (Background on this error at: https://sqlalche.me/e/14/gkpj) (Background on 
this error at: https://sqlalche.me/e/14/7s2a)
   ```
   
   Any help or guidance on resolving this issue would be greatly appreciated.
   
   ### What you think should happen instead?
   
   The DAG processor should handle a large number of DAGs without causing them 
to appear and disappear frequently. The state of the DAG processor should 
remain stable and not fluctuate between healthy and unhealthy.
   
   ### How to reproduce
   
   
   - Run the DAG processor as a subprocess of a scheduler job.
   - Migrate the DAG processor to run as a standalone process deployment.
   - Deploy in an environment with a large number of DAGs (e.g., more than 2000 
DAGs).
   
   ### Operating System
   
   AKS 1.29, AzureLinux
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to