It happens mostly when the scheduler is catching up. More specifically, when I load a brand new DAG with a start date in the past. Usually I have it set to run 5 DAG runs at the same time, and up to 16 tasks at the same time.
What I've also noticed is that the tasks will sit completed in reality but uncompleted in the Airflow DB for many hours, but if I just leave them all sitting there over night they all tend to be marked complete the next morning. Perhaps this points to some sort of Celery timeout or connection retry interval? ᐧ -- Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | New Relic On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < maximebeauche...@gmail.com> wrote: > By the time "INFO - Task exited with return code 0" gets logged, the task > should have been marked as successful by the subprocess. I have no specific > intuition as to what the issue may be. > > I'm guessing at that point the job stops emitting heartbeat and eventually > the scheduler will handle it as a failure? > > How often does that happen? > > Max > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com> wrote: > > > From what I can tell, it only affects CeleryExecutor. I've never seen > this > > behavior with LocalExecutor before. > > > > Max, do you know anything about this type of failure mode? > > ᐧ > > > > -- > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement > | > > New Relic > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <thejo...@gmail.com> > > wrote: > > > > > We have the exact same problem. In our case, it's a bash operator > > starting > > > a docker container. The container and process it ran exit, but the > > 'docker > > > run' command is still showing up in the process table, waiting for an > > > event. > > > I'm trying to switch to LocalExecutor to see if that will help. > > > > > > _jonas > > > > > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com> wrote: > > > > > > > Hello, > > > > > > > > Has anyone seen the behavior when using CeleryExecutor where workers > > will > > > > finish their tasks ("INFO - Task exited with return code 0" shows in > > the > > > > logs) but are never marked as complete in the airflow DB or UI? > > > Effectively > > > > this causes tasks to hang even though they are complete, and the DAG > > will > > > > not continue. > > > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps have a > > > > workaround? > > > > > > > > Thanks! > > > > > > > > -- > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > Engagement > > > | > > > > New Relic > > > > ᐧ > > > > > > > > > >