Hey, That's an interesting example, you use the KubernetesExecutor along with the PodOperator, so basically a pod will be started and will poke the execution of another pod... it looks silly to me, monitor pod from the PodOperator should be done by the scheduler directly when using the KubernetesExecutor.
Don't you think ? E On Tue, Apr 16, 2019 at 10:13 PM Kamil Gałuszka <ka...@flyrlabs.com> wrote: > Hi Daniel, > > we will definitely try this out. We are still on 1.10.2, so we will do the > upgrade and see how it goes. > > Thanks > Kamil > > On Tue, Apr 16, 2019 at 9:17 PM Daniel Imberman < > dimberman.opensou...@gmail.com> wrote: > > > Hi Kamil, > > > > So if it's airflow that's the issue then the PR I posted should have > solved > > it. Have you upgraded to 1.10.3 and added the > > worker_pods_creation_batch_size variable to your airflow.cfg? This should > > allow multiple pods to be launched in parallel. > > > > Also unfortunately the screenshot appears to be broken. Could you please > > upload it as an imgur link? > > > > On Tue, Apr 16, 2019 at 10:50 AM Kamil Gałuszka <ka...@flyrlabs.com> > > wrote: > > > > > Hi Daniel, > > > > > > It's airflow. > > > > > > This is DAG that we could show. Of course, we can change this to > > > KubernetesPodOperator and this get's even worse. > > > > > > ``` > > > from airflow import DAG > > > from datetime import datetime, timedelta > > > from airflow.operators.dummy_operator import DummyOperator > > > from airflow.operators.bash_operator import BashOperator > > > > > > default_args = { > > > 'owner': 'airflow', > > > 'depends_on_past': False, > > > 'start_date': datetime(2019, 4, 10, 23), > > > 'email': ['airf...@example.com'], > > > 'email_on_failure': False, > > > 'email_on_retry': False, > > > 'retries': 5, > > > 'retry_delay': timedelta(minutes=2) > > > } > > > > > > SLEEP_DURATION = 300 > > > > > > dag = DAG( > > > 'wide_bash_test_100_300', default_args=default_args, > > > schedule_interval=timedelta(minutes=60), max_active_runs=1, > > concurrency=100) > > > > > > start = DummyOperator(task_id='dummy_start_point', dag=dag) > > > end = DummyOperator(task_id='dummy_end_point', dag=dag) > > > > > > for i in range(300): > > > > > > sleeper_agent = BashOperator(task_id=f'sleep_{i}', > > > bash_command=f'sleep > {SLEEP_DURATION}', > > > dag=dag) > > > > > > start >> sleeper_agent >> end > > > ``` > > > > > > And here are some stuff that happens after, some tasks failed, some are > > > retrying, and definietly we don't have 100 concurrency on it. We have > > > autoscaling of nodes in GKE, so every pod after some time should move > > from > > > Pending to Running. With 300 concurrency this gets little worse. > > > > > > [image: Screen Shot 2019-04-16 at 10.30.18 AM (1).png] > > > > > > Thanks > > > Kamil > > > > > > On Tue, Apr 16, 2019 at 6:40 PM Daniel Imberman < > > > dimberman.opensou...@gmail.com> wrote: > > > > > >> Hi Kamil, > > >> > > >> Could you explain your use-case a little further? Is it that your k8s > > >> cluster runs into issues launching 250 tasks at the same time or that > > >> airflow runs into issues launching 250 tasks at the same time? I'd > love > > to > > >> know more so I could try to address it in a future airflow release. > > >> > > >> Thanks! > > >> > > >> Daniel > > >> > > >> On Tue, Apr 16, 2019 at 3:32 AM Kamil Gałuszka <ka...@flyrlabs.com> > > >> wrote: > > >> > > >> > Hey, > > >> > > > >> > We are quite interested in that Executor too but my main concern > isn't > > >> it a > > >> > > waste of resource to start a whole pod to run thing like > > DummyOperator > > >> > for > > >> > > example ? We have a cap of 200 tasks at any given time and we > > >> regularly > > >> > hit > > >> > > this cap, we cope with that with 20 celery workers but with the > > >> > > KubernetesExecutor that would mean 200 pods, does it really scale > > that > > >> > > easily ? > > >> > > > > >> > > > >> > Unfortunately no. > > >> > > > >> > We are now having problem of having a DAG with 300 tasks in DAG that > > >> should > > >> > start parallel at once, and there is only about 140 task instances > > >> started. > > >> > Setting parallelism to 256 didn't help and system struggles to get > the > > >> > numbers up that high for running tasks. > > >> > > > >> > The biggest problem that we have now, is to find bottleneck in > > >> scheduler, > > >> > but it's taking time to debug it. > > >> > > > >> > We will definitely be investigating that further and share findings > > but > > >> as > > >> > for now, I wouldn't say it's "non-problematic" as some other people > > >> stated. > > >> > > > >> > Thanks > > >> > Kamil > > >> > > > >> > > > > > > -- GetYourGuide AG Stampfenbachstrasse 48 8006 Zürich Switzerland <https://www.facebook.com/GetYourGuide> <https://twitter.com/GetYourGuide> <https://www.instagram.com/getyourguide/> <https://www.linkedin.com/company/getyourguide-ag> <http://www.getyourguide.com>