Hi Daniel,

It's airflow.

This is DAG that we could show. Of course, we can change this to
KubernetesPodOperator and this get's even worse.

```
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2019, 4, 10, 23),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=2)
}

SLEEP_DURATION = 300

dag = DAG(
    'wide_bash_test_100_300', default_args=default_args,
schedule_interval=timedelta(minutes=60), max_active_runs=1, concurrency=100)

start = DummyOperator(task_id='dummy_start_point', dag=dag)
end = DummyOperator(task_id='dummy_end_point', dag=dag)

for i in range(300):

    sleeper_agent = BashOperator(task_id=f'sleep_{i}',
                                 bash_command=f'sleep {SLEEP_DURATION}',
                                 dag=dag)

    start >> sleeper_agent >> end
```

And here are some stuff that happens after, some tasks failed, some are
retrying, and definietly we don't have 100 concurrency on it. We have
autoscaling of nodes in GKE, so every pod after some time should move from
Pending to Running. With 300 concurrency this gets little worse.

[image: Screen Shot 2019-04-16 at 10.30.18 AM (1).png]

Thanks
Kamil

On Tue, Apr 16, 2019 at 6:40 PM Daniel Imberman <
[email protected]> wrote:

> Hi Kamil,
>
> Could you explain your use-case a little further? Is it that your k8s
> cluster runs into issues launching 250 tasks at the same time or that
> airflow runs into issues launching 250 tasks at the same time? I'd love to
> know more so I could try to address it in a future airflow release.
>
> Thanks!
>
> Daniel
>
> On Tue, Apr 16, 2019 at 3:32 AM Kamil Gałuszka <[email protected]> wrote:
>
> > Hey,
> >
> > We are quite interested in that Executor too but my main concern isn't
> it a
> > > waste of resource to start a whole pod to run thing like DummyOperator
> > for
> > > example ? We have a cap of 200 tasks at any given time and we regularly
> > hit
> > > this cap, we cope with that with 20 celery workers but with the
> > > KubernetesExecutor that would mean 200 pods, does it really scale that
> > > easily ?
> > >
> >
> > Unfortunately no.
> >
> > We are now having problem of having a DAG with 300 tasks in DAG that
> should
> > start parallel at once, and there is only about 140 task instances
> started.
> > Setting parallelism to 256 didn't help and system struggles to get the
> > numbers up that high for running tasks.
> >
> > The biggest problem that we have now, is to find bottleneck in scheduler,
> > but it's taking time to debug it.
> >
> > We will definitely be investigating that further and share findings but
> as
> > for now, I wouldn't say it's "non-problematic" as some other people
> stated.
> >
> > Thanks
> > Kamil
> >
>

Reply via email to