karenbraganz opened a new issue, #57167: URL: https://github.com/apache/airflow/issues/57167
### Apache Airflow version 3.1.0 ### If "Other Airflow 2/3 version" selected, which one? _No response_ ### What happened? Supervision of a task instance failed after a single failed heartbeat attempt even though the max_failed_heartbeats is set to 3. This happened because an exception was raised when the [_handle_heartbeat_failures function](https://github.com/apache/airflow/blob/54bd5d8cd9f6f477cc83445737614dec81c4323c/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1116) was called. During the first failed heartbeat attempt, the _handle_heartbeat_failures function logs a message by calling log.warning(), which accepts an exception parameter that expects a string type object. However, in the source code, [an exception type object is passed](https://github.com/apache/airflow/blob/54bd5d8cd9f6f477cc83445737614dec81c4323c/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1126) instead of a string type object. This results in a TypeError (like below) which causes task supervision to fail. ``` TypeError: can only concatenate str (not "RemoteProtocolError") to str ``` I have attached the stack trace from the worker logs. [worker_typeerror.rtf](https://github.com/user-attachments/files/23104884/worker_typeerror.rtf) ### What you think should happen instead? I believe a string should be passed (like below) instead of an exception object [here](https://github.com/apache/airflow/blob/54bd5d8cd9f6f477cc83445737614dec81c4323c/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1126). ``` exception=str(exc) ``` Alternatively, I think we could pass `exc_info=True` instead of the exception parameter. This is what was done [until 3.0.4](https://github.com/apache/airflow/blob/367d8680af355b492f256ab86aa738f9ee292f2f/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1039). I am not sure if there was a specific reason for changing this. ### How to reproduce Run a task instance and kill the API server once the task instance begins running. You will notice that the task supervision fails upon the first failed heartbeat. ### Operating System Debian GNU/Linux ### Versions of Apache Airflow Providers _No response_ ### Deployment Astronomer ### Deployment details _No response_ ### Anything else? This also affects Airflow [3.0.5](https://github.com/apache/airflow/blob/c8f60cc2f1cd8064ea1806bde1ed6726191c60d5/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1041) and [3.0.6](https://github.com/apache/airflow/blob/e965c2e676d85ced65a485d4b2601addc2fd3e97/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1119). Prior to that, [exc_info=True](https://github.com/apache/airflow/blob/367d8680af355b492f256ab86aa738f9ee292f2f/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1039) was used. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
