Re: Scheduler Stuck when using LocalExecutor and KubernetesPodOperator

Maulik Soneji Fri, 13 Dec 2019 00:28:09 -0800

Hi Tao,

We have an internal tool called Optimus that we use for ETL
transformations. Optimus takes in the Bigquery transformation requests and
creates dags and leverages ExternalTaskSensor to manage the dependencies
between dags. We have used KubernetesOperator for this tool, as it has to
run an image.
We were earlier using Google Cloud Composer to host this tool and have
experienced issues around resource starvation and scheduling dags.


Thus we migrated to use Airflow on Kubernetes. At this time, we had been
using Airflow on Kubernetes using KubernetesExecutor for our batch
transformation pipeline.
We have an internal helm chart for our custom use cases which uses
KubernetesExecutor. We didn't add support for CeleryExecutor in our helm
chart then as our use cases were solved using KubernetesExecutor.

Since we didn't have support for CeleryExecutor in our helm chart, we tried
using LocalExecutor for Optimus.

We will add support for CeleryExecutor in our helm chart and start using it.

Hope that answers your question.

Thanks and regards,
Maulik

On Fri, Dec 13, 2019 at 1:00 PM Tao Feng <fengta...@gmail.com> wrote:

> Curious to know why you want to run k8spodoperator while you have k8s
> executor in prod? Could it solve for a different use case that cannot solve
> by k8s executor?
>
> On Thu, Dec 12, 2019 at 11:13 PM Maulik Soneji <maulik.son...@gojek.com>
> wrote:
>
>> Hi Maxime,
>>
>> We have been using KubernetesExecutor for our ETL use cases.
>>
>> This is a separate instance of airflow where we have to run an image using
>> KubernetesPodOperator, having KubernetesExecutor along with
>> KubernetesPodOperator would create too many pods and we experienced
>> scheduling delays while running the task with this approach. That is the
>> reason why we shifted to using LocalExecutor along with
>> KubernetesPodOperator.
>>
>> I realize that using LocalExecutor would not be the right choice for
>> production use cases, we will move to use CeleryExecutor.
>>
>> Thanks and regards,
>> Maulik
>>
>> On Fri, Dec 13, 2019 at 12:10 PM Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>>
>> > Friend don't let friends use LocalExecutor in production.
>> >
>> > LocalExecutor is essentially a subprocess pool running in-process. When
>> I
>> > wrote it originally I never thought it would ever be used in production.
>> > Celery / CeleryExecutor is more reasonable as Celery is a proper
>> > process/thread pool that's configurable and does what you want it to do.
>> > It's clearly lacking orchestration features like containerization and
>> > resource containment, but it's a decent solution if you don't have a k8s
>> > cluster laying around.
>> >
>> > Max
>> >
>> > On Thu, Dec 12, 2019 at 10:27 PM Maulik Soneji <maulik.son...@gojek.com
>> >
>> > wrote:
>> >
>> >> Hello,
>> >>
>> >> I realize that the scheduler is waiting for the tasks to be completed
>> >> before shutting down.
>> >>
>> >> The problem is that the scheduler stops sending heartbeat and just
>> waits
>> >> for the task queue to be joined.
>> >>
>> >> Is there a way where we can horizontally scale the number of instances
>> for
>> >> scheduler, so that if one scheduler instance is waiting for the task
>> >> queue to complete, the other scheduler instance is running the jobs and
>> >> sending heartbeat messages?
>> >>
>> >> Please share some context on how we can better scale scheduler to
>> handle
>> >> cases of failure.
>> >>
>> >> Thanks and regards,
>> >> Maulik
>> >>
>> >>
>> >>
>> >> W
>> >>
>> >>
>> >> On Fri, Dec 13, 2019 at 9:18 AM Maulik Soneji <maulik.son...@gojek.com
>> >
>> >> wrote:
>> >>
>> >> > Hello all,
>> >> >
>> >> > *TLDR*: We are using local executor with KubernetesPodOperator for
>> our
>> >> > airflow dags.
>> >> > From stack trace of scheduler we see that it is waiting on queue to
>> >> join.
>> >> >
>> >> > File:
>> >>
>> "/usr/local/lib/python3.7/site-packages/airflow/executors/local_executor.py",
>> >> line 212, in end
>> >> >     self.queue.join()
>> >> >
>> >> > Airflow version: 1.10.6
>> >> >
>> >> > Usecase: We are using airflow for ETL pipeline to transform data from
>> >> one
>> >> > bigquery table to another bigquery table.
>> >> >
>> >> > We have noticed that the scheduler frequently hangs without any
>> detailed
>> >> > logs.
>> >> >
>> >> > Last lines are below:
>> >> >
>> >> > [2019-12-12 13:28:53,370] {settings.py:277} DEBUG - Disposing DB
>> >> connection pool (PID 912099)
>> >> > [2019-12-12 13:28:53,860] {dag_processing.py:693} INFO - Sending
>> >> termination message to manager.
>> >> > [2019-12-12 13:28:53,860] {scheduler_job.py:1492} INFO - Deactivating
>> >> DAGs that haven't been touched since 2019-12-12T01:28:52.682686+00:00
>> >> >
>> >> > On checking the thread dumps of the running process we found that the
>> >> > scheduler is stuck waiting to join the tasks in queue.
>> >> >
>> >> > File:
>> >>
>> "/usr/local/lib/python3.7/site-packages/airflow/executors/local_executor.py",
>> >> line 212, in end
>> >> >     self.queue.join()
>> >> >
>> >> > Detailed thread dump is below for all processes running in the Local
>> >> > scheduler can be found here: https://pastebin.com/i6ChafWH
>> >> >
>> >> > Please help in checking why the scheduler is getting stuck waiting
>> for
>> >> the
>> >> > queue to join.
>> >> >
>> >> > Thanks and Regards,
>> >> > Maulik
>> >> >
>> >>
>> >
>>
>

Re: Scheduler Stuck when using LocalExecutor and KubernetesPodOperator

Reply via email to