oli2tup opened a new issue, #47023:
URL: https://github.com/apache/airflow/issues/47023

   ### Apache Airflow Provider(s)
   
   microsoft-azure
   
   ### Versions of Apache Airflow Providers
   
   8.3.0
   
   ### Apache Airflow version
   
   2.7.3
   
   ### Operating System
   
   Ubuntu 20.04
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   Airflow running on a VM hosted in Azure
   
   ### What happened
   
   We are experiencing an issue with Azure Spot Containers where their status 
continuously cycles between Unhealthy → Repairing → Running, without actually 
executing any tasks.
   
   - When they return to the Running state, they remain idle and do not perform 
any actions.
   - Eventually, they go back to Unhealthy, repeating the cycle indefinitely.
   - Since they don’t stay in any state for long, they can bypass both 
container and Airflow timeouts.
   - Attempting to manually SSH into a container that reaches the Running state 
after being Unhealthy fails. In our experience, nothing can be done with the 
container other than terminating it.
   - It seems to occur about 10% of the time to SPOT containers in EU-West.
   
   
   ### What you think should happen instead
   
   Ideally, the container should be forcefully terminated when it enters the 
Unhealthy state to prevent this looping behaviour.
   
   ### How to reproduce
   
   Since this is a randomly occurring issue, there is no single snippet of code 
that can consistently reproduce it. However, this can increase the likelihood 
of encountering the problem:
   
   - Deploy multiple Azure Spot Containers running Airflow tasks.
   - Run tasks during peak hours (e.g., in the EU West region) to increase the 
chances
   - Monitor container lifecycle events to check if they enter an Unhealthy → 
Repairing → Running loop.
   - (Optional) Manually find a way to spoof the container's status as 
"Unhealthy."
   - Try SSH into a container that enters the "Running" state after being 
Unhealthy—it should fail.
   
   It is difficult to force it to happen on demand.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to