Hi Daniel, we will definitely try this out. We are still on 1.10.2, so we will do the upgrade and see how it goes.
Thanks Kamil On Tue, Apr 16, 2019 at 9:17 PM Daniel Imberman < [email protected]> wrote: > Hi Kamil, > > So if it's airflow that's the issue then the PR I posted should have solved > it. Have you upgraded to 1.10.3 and added the > worker_pods_creation_batch_size variable to your airflow.cfg? This should > allow multiple pods to be launched in parallel. > > Also unfortunately the screenshot appears to be broken. Could you please > upload it as an imgur link? > > On Tue, Apr 16, 2019 at 10:50 AM Kamil Gałuszka <[email protected]> > wrote: > > > Hi Daniel, > > > > It's airflow. > > > > This is DAG that we could show. Of course, we can change this to > > KubernetesPodOperator and this get's even worse. > > > > ``` > > from airflow import DAG > > from datetime import datetime, timedelta > > from airflow.operators.dummy_operator import DummyOperator > > from airflow.operators.bash_operator import BashOperator > > > > default_args = { > > 'owner': 'airflow', > > 'depends_on_past': False, > > 'start_date': datetime(2019, 4, 10, 23), > > 'email': ['[email protected]'], > > 'email_on_failure': False, > > 'email_on_retry': False, > > 'retries': 5, > > 'retry_delay': timedelta(minutes=2) > > } > > > > SLEEP_DURATION = 300 > > > > dag = DAG( > > 'wide_bash_test_100_300', default_args=default_args, > > schedule_interval=timedelta(minutes=60), max_active_runs=1, > concurrency=100) > > > > start = DummyOperator(task_id='dummy_start_point', dag=dag) > > end = DummyOperator(task_id='dummy_end_point', dag=dag) > > > > for i in range(300): > > > > sleeper_agent = BashOperator(task_id=f'sleep_{i}', > > bash_command=f'sleep {SLEEP_DURATION}', > > dag=dag) > > > > start >> sleeper_agent >> end > > ``` > > > > And here are some stuff that happens after, some tasks failed, some are > > retrying, and definietly we don't have 100 concurrency on it. We have > > autoscaling of nodes in GKE, so every pod after some time should move > from > > Pending to Running. With 300 concurrency this gets little worse. > > > > [image: Screen Shot 2019-04-16 at 10.30.18 AM (1).png] > > > > Thanks > > Kamil > > > > On Tue, Apr 16, 2019 at 6:40 PM Daniel Imberman < > > [email protected]> wrote: > > > >> Hi Kamil, > >> > >> Could you explain your use-case a little further? Is it that your k8s > >> cluster runs into issues launching 250 tasks at the same time or that > >> airflow runs into issues launching 250 tasks at the same time? I'd love > to > >> know more so I could try to address it in a future airflow release. > >> > >> Thanks! > >> > >> Daniel > >> > >> On Tue, Apr 16, 2019 at 3:32 AM Kamil Gałuszka <[email protected]> > >> wrote: > >> > >> > Hey, > >> > > >> > We are quite interested in that Executor too but my main concern isn't > >> it a > >> > > waste of resource to start a whole pod to run thing like > DummyOperator > >> > for > >> > > example ? We have a cap of 200 tasks at any given time and we > >> regularly > >> > hit > >> > > this cap, we cope with that with 20 celery workers but with the > >> > > KubernetesExecutor that would mean 200 pods, does it really scale > that > >> > > easily ? > >> > > > >> > > >> > Unfortunately no. > >> > > >> > We are now having problem of having a DAG with 300 tasks in DAG that > >> should > >> > start parallel at once, and there is only about 140 task instances > >> started. > >> > Setting parallelism to 256 didn't help and system struggles to get the > >> > numbers up that high for running tasks. > >> > > >> > The biggest problem that we have now, is to find bottleneck in > >> scheduler, > >> > but it's taking time to debug it. > >> > > >> > We will definitely be investigating that further and share findings > but > >> as > >> > for now, I wouldn't say it's "non-problematic" as some other people > >> stated. > >> > > >> > Thanks > >> > Kamil > >> > > >> > > >
