oli2tup opened a new issue, #47023: URL: https://github.com/apache/airflow/issues/47023
### Apache Airflow Provider(s) microsoft-azure ### Versions of Apache Airflow Providers 8.3.0 ### Apache Airflow version 2.7.3 ### Operating System Ubuntu 20.04 ### Deployment Other ### Deployment details Airflow running on a VM hosted in Azure ### What happened We are experiencing an issue with Azure Spot Containers where their status continuously cycles between Unhealthy → Repairing → Running, without actually executing any tasks. - When they return to the Running state, they remain idle and do not perform any actions. - Eventually, they go back to Unhealthy, repeating the cycle indefinitely. - Since they don’t stay in any state for long, they can bypass both container and Airflow timeouts. - Attempting to manually SSH into a container that reaches the Running state after being Unhealthy fails. In our experience, nothing can be done with the container other than terminating it. - It seems to occur about 10% of the time to SPOT containers in EU-West. ### What you think should happen instead Ideally, the container should be forcefully terminated when it enters the Unhealthy state to prevent this looping behaviour. ### How to reproduce Since this is a randomly occurring issue, there is no single snippet of code that can consistently reproduce it. However, this can increase the likelihood of encountering the problem: - Deploy multiple Azure Spot Containers running Airflow tasks. - Run tasks during peak hours (e.g., in the EU West region) to increase the chances - Monitor container lifecycle events to check if they enter an Unhealthy → Repairing → Running loop. - (Optional) Manually find a way to spoof the container's status as "Unhealthy." - Try SSH into a container that enters the "Running" state after being Unhealthy—it should fail. It is difficult to force it to happen on demand. ### Anything else _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
