Nice post Abhishek! Glad our discussion was helpful for you guys.

To share more context with the community, Airbnb had task stuck in QUEUED
state problem before too. Our issues were more on the executor side.
Originally it was because message lost issue in early version celery, which
Alex Guziel applied an internal logic to resend tasks if they were not
picked up by the worker after 60s. Recently it was because of message
delivery delay issue on our Redis broker hosted on AWS Elasticache, which
we solved by moving to SQS( ya a sudden strange headache, attaching two
graphs to show the diff).

For the issue you guys have, which can be fixed by restarting the
scheduler, I took a quick look at the scheduler code and my wild guess of
the root cause is this piece of logic
<https://github.com/apache/airflow/blob/e5726c761d08bfddb6bb8acf3ecc381220eea140/airflow/jobs/scheduler_job.py#L962-L967>--from
my understanding the biggest thing of scheduler restart is that executor
state will be flushed, which is aligned with the behavior.

We also have similar scheduler health checks and restart logic, maybe my
peers can add more details later ;)

And thanks Max for sharing👍


Cheers,
Kevin Y


On Fri, Aug 9, 2019 at 2:17 PM Tao Feng <fengta...@gmail.com> wrote:

> +1 Max, thanks for sharing!
>
> On Fri, Aug 9, 2019 at 2:05 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > +1
> >
> > On Fri, Aug 9, 2019 at 10:54 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> Thanks to Abhishek Ray @ Robinhood for this great post. I felt like I
> had
> >> to share it here
> >>
> >>
> https://robinhood.engineering/upgrading-scaling-airflow-at-robinhood-5b625dfaa2ee
> >>
> >> Max
> >>
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
>

Reply via email to