Nice post Abhishek! Glad our discussion was helpful for you guys. To share more context with the community, Airbnb had task stuck in QUEUED state problem before too. Our issues were more on the executor side. Originally it was because message lost issue in early version celery, which Alex Guziel applied an internal logic to resend tasks if they were not picked up by the worker after 60s. Recently it was because of message delivery delay issue on our Redis broker hosted on AWS Elasticache, which we solved by moving to SQS( ya a sudden strange headache, attaching two graphs to show the diff).
For the issue you guys have, which can be fixed by restarting the scheduler, I took a quick look at the scheduler code and my wild guess of the root cause is this piece of logic <https://github.com/apache/airflow/blob/e5726c761d08bfddb6bb8acf3ecc381220eea140/airflow/jobs/scheduler_job.py#L962-L967>--from my understanding the biggest thing of scheduler restart is that executor state will be flushed, which is aligned with the behavior. We also have similar scheduler health checks and restart logic, maybe my peers can add more details later ;) And thanks Max for sharing👍 Cheers, Kevin Y On Fri, Aug 9, 2019 at 2:17 PM Tao Feng <fengta...@gmail.com> wrote: > +1 Max, thanks for sharing! > > On Fri, Aug 9, 2019 at 2:05 PM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > > > +1 > > > > On Fri, Aug 9, 2019 at 10:54 PM Maxime Beauchemin < > > maximebeauche...@gmail.com> wrote: > > > >> Thanks to Abhishek Ray @ Robinhood for this great post. I felt like I > had > >> to share it here > >> > >> > https://robinhood.engineering/upgrading-scaling-airflow-at-robinhood-5b625dfaa2ee > >> > >> Max > >> > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > > >