[I] Issue with task_queued_timeout Causing Silent Task Failures in Airflow 2.10.5 [airflow]

via GitHub Mon, 02 Jun 2025 07:52:00 -0700


KarthikeyanDevendhiran opened a new issue, #51301:
URL: https://github.com/apache/airflow/issues/51301

### Apache Airflow version

Other Airflow 2 version (please specify below)

### If "Other Airflow 2 version" selected, which one?

2.10.5

### What happened?

After upgrading to Airflow 2.10.5, we received the following deprecation
warning:

`/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py:41

DeprecationWarning: The '[kubernetes_executor] worker_pods_pending_timeout'
config option is deprecated.
Please update your config to use '[scheduler] task_queued_timeout' instead.`

We updated our configuration by setting task_queued_timeout = 300 under the
[scheduler] section. A few days later, we noticed that several tasks in our
production DAGs were failing after remaining in the queued state for 5 minutes.
However, no failure alerts were triggered, even though the DAGs are configured
to retry 3 times and send alerts via Slack.

We assumed that since the tasks never started execution, the alerting
mechanism might not have been triggered. However, our understanding is that
task_queued_timeout should mark the task as failed and trigger alerts.

To mitigate the issue, we increased the timeout to 900 seconds (15 minutes).
This worked for about three weeks, but we are now seeing the same issue
again—tasks are failing after being queued for over 15 minutes, without any
alerts. This is happening despite having 100-node capacity available in our
Databricks instance pool.

Either the task_queued_timeout is not integrated with the alerting mechanism.
Or the tasks are not being scheduled properly despite available resources,
possibly due to a scheduling or executor issue.

### What you think should happen instead?

- Tasks that fail due to task_queued_timeout should trigger failure alerts,
just like any other task failure.
- Tasks should not remain in the queued state for extended periods if
sufficient executor capacity is available.

### How to reproduce

- Upgrade to Airflow 2.10.5.
- Set task_queued_timeout = 300 in airflow.cfg under [scheduler].
- Deploy a DAG with tasks that may experience scheduling delays.
- Observe tasks failing after 5 minutes in the queued state without
triggering alerts.
- Increase timeout to 900 seconds and observe recurrence under similar
conditions.

### Operating System

linux/amd64

### Versions of Apache Airflow Providers

_No response_

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

![Image](https://github.com/user-attachments/assets/b3720ba6-0f16-4885-a08e-d36a15a645ce)

We maintain our helm charts in a git repo and deploy them in Kubernetes
using ArgoCD.

### Anything else?

- Does not appear to be a callback code error, as it works for other
tasks/DAGs.
- Logs for the failed task show:
`*** Could not read served logs: Invalid URL 'http://:8793/log/dag_id=...':
No host supplied`

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Issue with task_queued_timeout Causing Silent Task Failures in Airflow 2.10.5 [airflow]

Reply via email to