Nice post Abhishek! Thanks Max for sharing!

In Airbnb, we also have a cron job to restart the scheduler. Our cron job
frequently check some critical metrics (canary delay, latest scheduler
heartbeat time, and latest dag_processor_manager_log modification time),
and restart the scheduler if anomaly was detected. This cron job helps to
restart the scheduler 1 or 2 times each day in Airbnb’s main cluster.

This cron job was originally added to mitigate random scheduling delay
caused by large number of executor events. Early this year, we experienced
some random scheduling delay, and found out that restarting scheduler
always solved the issue. Looked into the scheduler log, we identified the
root cause to be slow executor event processing. Special executor event
handling is required when the celery task state doesn't match the state in
the metastore DB. There are various reasons that such executor events can
appear, e.g. DAG parse failure on worker side. Each executor event will
trigger a DAG parsing in the main scheduler loop. Today, these executor
events are handled sequentially, which can take a long time and delay the
scheduling. Restarting the scheduler can flush the executor event buffer
and release the scheduler from the pain.

In addition to the cron job, we also interested in solving this problem in
Airflow’s core logic. We have an open PR
<https://github.com/apache/airflow/pull/5215> changing the scheduler logic
to handle these events in parallel.

I hope our experience can help some people in the community to have a
deeper understanding of Airflow scheduling logic, and to improve their user
experience.


Cheers,

Cong Zhu


On 2019/08/09 23:33:18, Kevin Yang <yrql...@gmail.com> wrote:
> Nice post Abhishek! Glad our discussion was helpful for you guys.
>
> To share more context with the community, Airbnb had task stuck in QUEUED
> state problem before too. Our issues were more on the executor side.
> Originally it was because message lost issue in early version celery,
which
> Alex Guziel applied an internal logic to resend tasks if they were not
> picked up by the worker after 60s. Recently it was because of message
> delivery delay issue on our Redis broker hosted on AWS Elasticache, which
> we solved by moving to SQS( ya a sudden strange headache, attaching two
> graphs to show the diff).
>
> For the issue you guys have, which can be fixed by restarting the
> scheduler, I took a quick look at the scheduler code and my wild guess of
> the root cause is this piece of logic
> <
https://github.com/apache/airflow/blob/e5726c761d08bfddb6bb8acf3ecc381220eea140/airflow/jobs/scheduler_job.py#L962-L967
>--from
> my understanding the biggest thing of scheduler restart is that executor
> state will be flushed, which is aligned with the behavior.
>
> We also have similar scheduler health checks and restart logic, maybe my
> peers can add more details later ;)
>
> And thanks Max for sharing👍
>
>
> Cheers,
> Kevin Y
>
>
> On Fri, Aug 9, 2019 at 2:17 PM Tao Feng <fengta...@gmail.com> wrote:
>
> > +1 Max, thanks for sharing!
> >
> > On Fri, Aug 9, 2019 at 2:05 PM Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> >
> > > +1
> > >
> > > On Fri, Aug 9, 2019 at 10:54 PM Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > >> Thanks to Abhishek Ray @ Robinhood for this great post. I felt like I
> > had
> > >> to share it here
> > >>
> > >>
> >
https://robinhood.engineering/upgrading-scaling-airflow-at-robinhood-5b625dfaa2ee
> > >>
> > >> Max
> > >>
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> > >
> >
>

Reply via email to