Re: Current state of airflow on kubernetes

Daniel Imberman Wed, 17 Apr 2019 14:10:48 -0700

The reason we didn't do it this way is that we would have needed to copy
all of the logic in the k8sPodOperator into the executor. This would've
added a lot more code/complexity to the executor.


While it does look silly from an external POV, it make sense from a
seperations-of-concern POV. The executor shouldn't care about what it's
executing. It passes the task off to a worker and the worker does the task
it's been given.

On Wed, Apr 17, 2019 at 2:50 AM Emmanuel Brard <
emmanuel.br...@getyourguide.com> wrote:

> Hey,
>
> That's an interesting example, you use the KubernetesExecutor along with
> the PodOperator, so basically a pod will be started and will poke the
> execution of another pod... it looks silly to me, monitor pod from the
> PodOperator should be done by the scheduler directly when using the
> KubernetesExecutor.
>
> Don't you think ?
>
> E
>
> On Tue, Apr 16, 2019 at 10:13 PM Kamil Gałuszka <ka...@flyrlabs.com>
> wrote:
>
> > Hi Daniel,
> >
> > we will definitely try this out. We are still on 1.10.2, so we will do
> the
> > upgrade and see how it goes.
> >
> > Thanks
> > Kamil
> >
> > On Tue, Apr 16, 2019 at 9:17 PM Daniel Imberman <
> > dimberman.opensou...@gmail.com> wrote:
> >
> > > Hi Kamil,
> > >
> > > So if it's airflow that's the issue then the PR I posted should have
> > solved
> > > it. Have you upgraded to 1.10.3 and added the
> > > worker_pods_creation_batch_size variable to your airflow.cfg? This
> should
> > > allow multiple pods to be launched in parallel.
> > >
> > > Also unfortunately the screenshot appears to be broken. Could you
> please
> > > upload it as an imgur link?
> > >
> > > On Tue, Apr 16, 2019 at 10:50 AM Kamil Gałuszka <ka...@flyrlabs.com>
> > > wrote:
> > >
> > > > Hi Daniel,
> > > >
> > > > It's airflow.
> > > >
> > > > This is DAG that we could show. Of course, we can change this to
> > > > KubernetesPodOperator and this get's even worse.
> > > >
> > > > ```
> > > > from airflow import DAG
> > > > from datetime import datetime, timedelta
> > > > from airflow.operators.dummy_operator import DummyOperator
> > > > from airflow.operators.bash_operator import BashOperator
> > > >
> > > > default_args = {
> > > >     'owner': 'airflow',
> > > >     'depends_on_past': False,
> > > >     'start_date': datetime(2019, 4, 10, 23),
> > > >     'email': ['airf...@example.com'],
> > > >     'email_on_failure': False,
> > > >     'email_on_retry': False,
> > > >     'retries': 5,
> > > >     'retry_delay': timedelta(minutes=2)
> > > > }
> > > >
> > > > SLEEP_DURATION = 300
> > > >
> > > > dag = DAG(
> > > >     'wide_bash_test_100_300', default_args=default_args,
> > > > schedule_interval=timedelta(minutes=60), max_active_runs=1,
> > > concurrency=100)
> > > >
> > > > start = DummyOperator(task_id='dummy_start_point', dag=dag)
> > > > end = DummyOperator(task_id='dummy_end_point', dag=dag)
> > > >
> > > > for i in range(300):
> > > >
> > > >     sleeper_agent = BashOperator(task_id=f'sleep_{i}',
> > > >                                  bash_command=f'sleep
> > {SLEEP_DURATION}',
> > > >                                  dag=dag)
> > > >
> > > >     start >> sleeper_agent >> end
> > > > ```
> > > >
> > > > And here are some stuff that happens after, some tasks failed, some
> are
> > > > retrying, and definietly we don't have 100 concurrency on it. We have
> > > > autoscaling of nodes in GKE, so every pod after some time should move
> > > from
> > > > Pending to Running. With 300 concurrency this gets little worse.
> > > >
> > > > [image: Screen Shot 2019-04-16 at 10.30.18 AM (1).png]
> > > >
> > > > Thanks
> > > > Kamil
> > > >
> > > > On Tue, Apr 16, 2019 at 6:40 PM Daniel Imberman <
> > > > dimberman.opensou...@gmail.com> wrote:
> > > >
> > > >> Hi Kamil,
> > > >>
> > > >> Could you explain your use-case a little further? Is it that your
> k8s
> > > >> cluster runs into issues launching 250 tasks at the same time or
> that
> > > >> airflow runs into issues launching 250 tasks at the same time? I'd
> > love
> > > to
> > > >> know more so I could try to address it in a future airflow release.
> > > >>
> > > >> Thanks!
> > > >>
> > > >> Daniel
> > > >>
> > > >> On Tue, Apr 16, 2019 at 3:32 AM Kamil Gałuszka <ka...@flyrlabs.com>
> > > >> wrote:
> > > >>
> > > >> > Hey,
> > > >> >
> > > >> > We are quite interested in that Executor too but my main concern
> > isn't
> > > >> it a
> > > >> > > waste of resource to start a whole pod to run thing like
> > > DummyOperator
> > > >> > for
> > > >> > > example ? We have a cap of 200 tasks at any given time and we
> > > >> regularly
> > > >> > hit
> > > >> > > this cap, we cope with that with 20 celery workers but with the
> > > >> > > KubernetesExecutor that would mean 200 pods, does it really
> scale
> > > that
> > > >> > > easily ?
> > > >> > >
> > > >> >
> > > >> > Unfortunately no.
> > > >> >
> > > >> > We are now having problem of having a DAG with 300 tasks in DAG
> that
> > > >> should
> > > >> > start parallel at once, and there is only about 140 task instances
> > > >> started.
> > > >> > Setting parallelism to 256 didn't help and system struggles to get
> > the
> > > >> > numbers up that high for running tasks.
> > > >> >
> > > >> > The biggest problem that we have now, is to find bottleneck in
> > > >> scheduler,
> > > >> > but it's taking time to debug it.
> > > >> >
> > > >> > We will definitely be investigating that further and share
> findings
> > > but
> > > >> as
> > > >> > for now, I wouldn't say it's "non-problematic" as some other
> people
> > > >> stated.
> > > >> >
> > > >> > Thanks
> > > >> > Kamil
> > > >> >
> > > >>
> > > >
> > >
> >
>
> --
>
>
>
>
>
>
>
>
> GetYourGuide AG
>
> Stampfenbachstrasse 48
>
> 8006 Zürich
>
> Switzerland
>
>
>
>  <https://www.facebook.com/GetYourGuide>
> <https://twitter.com/GetYourGuide>
> <https://www.instagram.com/getyourguide/>
> <https://www.linkedin.com/company/getyourguide-ag>
> <http://www.getyourguide.com>
>
>
>
>
>
>
>
>

Re: Current state of airflow on kubernetes

Reply via email to