Two questions: 1) Are you eventually seeing the full log for the task, after it finishes? 2) Are you using S3 to store your logs?
On Thu, Feb 14, 2019 at 11:53 AM Dan Stoner <[email protected]> wrote: > More info! > > It appears that the Celery executor will silently fail if the > credentials to a postgres results_backend are not valid. > > For example, we see: > > [2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not > met for <TaskInstance: update_table_progress.update_table > 2019-02-13T20:30:00+00:00 [running]>, dependency 'Task Instance Not > Already Running' FAILED: Task is already running, it started on > 2019-02-13 20:45:09.088978+00:00. > [2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not > met for <TaskInstance: update_table_progress.update_table > 2019-02-13T20:30:00+00:00 [running]>, dependency 'Task Instance State' > FAILED: Task is in the 'running' state which is not a valid state for > execution. The task must be cleared in order to be run. > [2019-02-13 20:45:21,135] {{logging_mixin.py:95}} INFO - [2019-02-13 > 20:45:21,134] {{jobs.py:2514}} INFO - Task is not able to be run > > > but no database connection failure anywhere in the logs. > > After fixing our connection string (via > AIRFLOW__CELERY__RESULT_BACKEND or result_backend in airflow.cfg), > these issues went away. > > > Sorry I cannot produce a more solid bug report but hopefully this is a > breadcrumb for someone. > > Dan Stoner > > On Wed, Feb 13, 2019 at 10:16 PM Dan Stoner <[email protected]> wrote: > > > > We saw this but the task instance state was generally "SUCCESS". > > > > In our case, we thought it was due to Redis being used as the results > > store. There is a WARNING against this right in the operational logs. > > Google Cloud Composer is surprisingly setup in this fashion. > > > > We went back to running our own infrastructure and using postgres as > > the results store, those issues have not occurred since. > > > > The real downside we saw to this error was that our workers were > > highly underutilized, we were getting terrible overall data > > throughput, and the workers kept trying to run these tasks they > > couldn't actually run. > > > > - Dan Stoner > > > > > > On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam <[email protected]> wrote: > > > > > > Friendly ping on the above! Has anyone encountered this by chance? > > > > > > We're still seeing it occasionally on longer running tasks. > > > > > > On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <[email protected]> > wrote: > > > > > > > Hi, > > > > > > > > We run Apache Airflow in Kubernetes in a manner very similar to what > is > > > > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for > > > > messaging, Postgres). > > > > > > > > Lately, we've encountered some of our Tasks getting stuck in a > running > > > > state, and printing out the errors: > > > > > > > > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not > met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, > dependency 'Task Instance Not Already Running' FAILED: Task is already > running, it started on 2018-11-19 23:29:11.974497+00:00. > > > >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not > met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, > dependency 'Task Instance State' FAILED: Task is in the 'running' state > which is not a valid state for execution. The task must be cleared in order > to be run. > > > >> > > > >> > > > > Is there anyway to avoid this? Does anyone know what causes this > issue? > > > > > > > > This is quite problematic. The task is stuck in running state without > > > > making any progress when the above error occurs, and so turning on > retries > > > > on doesn't help with getting our DAGs to reliably run to completion. > > > > > > > > Thanks! > > > > > > > > [1] https://github.com/puckel/docker-airflow > > > > >
