KarthikeyanDevendhiran opened a new issue, #51301:
URL: https://github.com/apache/airflow/issues/51301

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.10.5
   
   ### What happened?
   
   After upgrading to Airflow 2.10.5, we received the following deprecation 
warning:
   
   
`/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py:41
 
   DeprecationWarning: The '[kubernetes_executor] worker_pods_pending_timeout' 
config option is deprecated. 
   Please update your config to use '[scheduler] task_queued_timeout' instead.`
   
   We updated our configuration by setting task_queued_timeout = 300 under the 
[scheduler] section. A few days later, we noticed that several tasks in our 
production DAGs were failing after remaining in the queued state for 5 minutes. 
However, no failure alerts were triggered, even though the DAGs are configured 
to retry 3 times and send alerts via Slack.
   
   We assumed that since the tasks never started execution, the alerting 
mechanism might not have been triggered. However, our understanding is that 
task_queued_timeout should mark the task as failed and trigger alerts.
   
   To mitigate the issue, we increased the timeout to 900 seconds (15 minutes). 
This worked for about three weeks, but we are now seeing the same issue 
again—tasks are failing after being queued for over 15 minutes, without any 
alerts. This is happening despite having 100-node capacity available in our 
Databricks instance pool.
   
   Either the task_queued_timeout is not integrated with the alerting mechanism.
   Or the tasks are not being scheduled properly despite available resources, 
possibly due to a scheduling or executor issue.
   
   ### What you think should happen instead?
   
   - Tasks that fail due to task_queued_timeout should trigger failure alerts, 
just like any other task failure.
   - Tasks should not remain in the queued state for extended periods if 
sufficient executor capacity is available.
   
   ### How to reproduce
   
   - Upgrade to Airflow 2.10.5.
   - Set task_queued_timeout = 300 in airflow.cfg under [scheduler].
   - Deploy a DAG with tasks that may experience scheduling delays.
   - Observe tasks failing after 5 minutes in the queued state without 
triggering alerts.
   - Increase timeout to 900 seconds and observe recurrence under similar 
conditions.
   
   ### Operating System
   
   linux/amd64
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   
![Image](https://github.com/user-attachments/assets/b3720ba6-0f16-4885-a08e-d36a15a645ce)
   
   We maintain our helm charts in a git repo and deploy them in Kubernetes 
using ArgoCD.
   
   ### Anything else?
   
   - Does not appear to be a callback code error, as it works for other 
tasks/DAGs.
   - Logs for the failed task show:
   `*** Could not read served logs: Invalid URL 'http://:8793/log/dag_id=...': 
No host supplied`
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to