Hi, Sorry for not providing enough info. It seems that after some series of steps the issue got itself resolved, here is what we did: 1) For one of the DAGs, there were some tasks that were “stuck” - the scheduler was continuously sending the new tasks to the queue, even thought the tasks had “success” status. Those would not run, because the status is success, but somehow the process was going on infinitely. Purging the corresponding rabbitmq queue got it working. 2) Next day we noticed that 1.8.0 has some changes about the way dags_folder is passed when the scheduler is generating command for execution. For some reasons there are differences in the dags_folder location between some machines. For now I just added symlink to make sure the dogs are discovered correctly on all machines. 3) After this, it seemed that the other DAGs got unstuck too, but next day we found that now the tasks in subdues were not getting scheduled even though the logs would say that all the dependencies were met. After some searching we applied this : https://github.com/apache/incubator-airflow/commit/5800f565628d11d8ea504468bcc14c4d1c0da10c <https://github.com/apache/incubator-airflow/commit/5800f565628d11d8ea504468bcc14c4d1c0da10c> and now it seems that everything is back to normal.
Unfortunately I can’t really give a proper postmortem because I’ve tried so many things in between and I’m not 100% sure which was the key. But thanks a lot for ideas on debugging, will use them next time I run into issues! -Dima On 2017-05-03 19:25 (+0300), Maxime Beauchemin <[email protected]> wrote: > One way to debug these "false starts" or tasks that don't even get to the> > point where the logging is initiated is to:> > 1. look at the scheduler log to get the exact command that is put in the> > queue for remote execution> > 2. copy the exact command> > 3. go on the worker and try to recreate the exact context in which the> > worker operates (unix user, env vars, shell type, ...)> > 4. run the command, hopefully you have recreated the false start at this> > point (the task does not run)> > 5. view the pre-logs (stdout), and debug from this context> > > A common scenario where this happens is say the DAG module imports some> > library that exists on the scheduler, but not in the worker's context, so> > the task can no even be initiated in any way.> > > One way to help prevent this is being very cautious that the run context> > for all your processes on the cluster are identical. You do not want to get> > in a place where the python environment is diverging on different boxes,> > unless you're using queues and you actually are doing it by choice.> > > Max> > > On Wed, May 3, 2017 at 5:39 AM, Bolke de Bruin <[email protected]> wrote:> > > > Hi Dmitry,> > >> > > Please provide more information, such as logs and the DAG definition> > > itself. This is very little to go on unfortunately.> > >> > > Bolke> > >> > > > On 3 May 2017, at 10:22, Dmitry Smirnov <[email protected]> wrote:> > > >> > > > Hi everyone,> > > >> > > > I'm using Airflow version 1.8.0, just upgraded from 1.7.1.3. The issue> > > that> > > > I'm going to describe started already in 1.7.1.3, I upgraded hoping it> > > > might help resolve it.> > > >> > > > I have several DAGs for which the *last* task is not moving from queued> > > to> > > > running.> > > > These DAGs used to run fine some time ago, but then we had issues with> > > > rabbitmq cluster we use, and after resetting it up, the problem emerged.> > > > I'm pretty sure the queue is working fine, since all the tasks except > > > the> > > > very last one are queued automatically and run fine.> > > > For the sake of testing, I added a copy of the last task to the DAG, and> > > > interestingly, the task that used to be the last and did not run, now> > > > started to run normally, but the new last task is stuck.> > > > I checked logs at the DEBUG level and I could see that scheduler queues> > > the> > > > tasks, but those tasks don't show up in the Celery/Flower dashboard in> > > the> > > > corresponding queue.> > > > When I run the task that is stuck from the webserver interface, they > > > show> > > > up in the queue in Flower dashboard and run successfully.> > > > So, overall, it seems that the issue is present with the scheduler but> > > not> > > > with webserver, and that this issue is only related to the very last > > > task> > > > in the DAG.> > > > I'm really stuck now, I would welcome any suggestions / ideas on what > > > can> > > > be done.> > > >> > > > Thank you in advance!> > > > BR, Dima> > > >> > > > --> > > >> > > > Dmitry Smirnov (MSc.)> > > > Data Engineer @ Yousician> > > > mobile: +358 50 3015072> > >> > >> >
