Hello, I am attempting to execute a workload using the KubernetesExecutor in an AWS EKS cluster. After a certain number of tasks start up, the pods start to take longer and longer to move from a Pending phase to a Running phase. The issue appears to be related to mounting the volumes that host the dags and logs folders. We start to see “FailedMount” events as the number of tasks increase.
The dags and logs folders are being mounted using PersistentVolumes and PersistentVolumeClaims. They are hosted on AWS EFS drives. I have set up the PersistentVolumes in 2 ways, both with the same results. 1. Using the EFS ECI driver 2. Using a hostPath, with the drives mounted on the underlying EC2 instance. As the workload begins to scale, the percentage of pods in Pending phase (vs Running phase) continues to grow. Eventually, pods spawned by the KubernetesPodOperator start to fail because the pod remains in the Pending phase for too long. I’ve worked with AWS support and they don’t believe that the issue is related to the EFS drives. From the evidence I can see, I tend to agree. Has anyone seen anything similar to this? Has anybody been able to successfully scale up Airflow on a K8S cluster? Thanks, Jim Majure| Principal Machine Learning Engineer aurishealth.com<http://aurishealth.com/> | 150 Shoreline Dr. | Redwood City, CA | 94065<webextlink://150%20Shoreline%20Dr.%20|%20 Redwood%20City,%20CA%20|%20 94065> (515) 829-0667 [signature_863625008]
