Hi everyone,
We are currently experimenting with using long-running sensor tasks in
reschedule-mode. Some of these sensors run more than 10 hours, rescheduling
every 5 minutes. I’m seeing a lot of these tasks failing without any task log
being stored.
Looking into the scheduler logs, I see a lot of messages like this (this
instance failed after 11 minutes):
> Executor reports task instance <TaskInstance: ***.*** 2019-12-14
> 00:00:00+00:00 [queued]> finished (success) although the task says its
> queued. Was the task killed externally?
We are using Airflow with the Celery executor, and Redis as the broker and
result backend (I hope I got the terminology right here). Some google indicates
that we should not be using Redis as the result_backend, but rather a database.
I’m happy to make this change, but I’d really like to understand better how
that would cause such errors. Can someone explain a bit more what the
result_backend really does, and why using Redis here might be causing problems?
The documentation also advises to use a visibility_timeout longer that the
longest running task with Celery - I’m wondering if this also includes
rescheduling sensors? I also have troubles understanding what specifically this
setting does. Can someone explain?
Are there any other configuration or setup issues that might be causing such
behaviour?
Thanks for your help,
Björn