Hello, You can see the zombie tasks killing in the scheduler logs (text file), I think you need the 'INFO' log level though. The detection is based on the time difference with the last heartbeat in the job table.
Best, E On Fri, Aug 24, 2018 at 4:36 PM Dubey, Somesh <somesh.du...@nordstrom.com> wrote: > Thanks so much Emmanuel for the reply. > How did Airflow classified those processes as Zombie. > Did any log say that. > As we are not able to get any good logs of tasks failing. > > Thanks so much- You are potentially saving hours of work/sleepless nights > as things are breaking production for us. > Somesh > > -----Original Message----- > From: Emmanuel Brard [mailto:emmanuel.br...@getyourguide.com] > Sent: Friday, August 24, 2018 6:48 AM > To: dev@airflow.incubator.apache.org > Subject: Re: airflow 1.9 tasks randomly failing | k8 - hive > > Hey, > > We have a similar setup with Airflow 1.9 on Kubernetes with the Celery > executor. > > We saw some airflow tasks being killed inside the container because of > cgroup limits (set by kubernetes and pushed to the docker daemon) but not > airflow itself (airflow celery command) which ended up in zombie being > identified by the scheduler (setting the task to fail). We decreased the > number of celery slots on each worker reducing memory pressure and this > stoped. It happened mostly with one DAG which had simple operators but was > quite big with subdags involved. > > Maybe you are facing the same issue ? > > Bye, > E > > On Fri, Aug 24, 2018 at 3:40 PM Dubey, Somesh <somesh.du...@nordstrom.com> > wrote: > > > We are on AirFlow 1.9 on Kubernetes. We have many hive DAGs which we > > call via JDBC (via rest end point) and randomly tasks fail. One day > > one would fail and next day some other. > > The pods are all healthy and there is no node eviction/any issues and > > retries typically works. Also even when the tasks fails it completes > > on the hive side successfully. > > We do not get good logs of the failure and typically get partial sql > > in sysout in the task log with no error message. Loglevel increased to > > 5 in jdbc connection. > > This only happens in production. > > > > Not been able to reproduce the issue in dev. > > The closest we have gotten in dev to replicate similar behavior is to > > kill the pid (manually killing airflow run --raw) in worker node. > > That case also things run fine on hive side but task fails in Airflow. > > Similar task log with no error is received. > > > > Have you seen this kind of behavior. Any help is much appreciated. > > > > Thanks so much, > > Somesh > > > > > > > > > > > > -- > > > > > > > > > GetYourGuide AG > > Stampfenbachstrasse 48 > > 8006 Zürich > > Switzerland > > > > <https://www.facebook.com/GetYourGuide> > <https://twitter.com/GetYourGuide> > <https://www.instagram.com/getyourguide/> > <https://www.linkedin.com/company/getyourguide-ag> > <http://www.getyourguide.com> > > > > > > > > -- GetYourGuide AG Stampfenbachstrasse 48 8006 Zürich Switzerland <https://www.facebook.com/GetYourGuide> <https://twitter.com/GetYourGuide> <https://www.instagram.com/getyourguide/> <https://www.linkedin.com/company/getyourguide-ag> <http://www.getyourguide.com>