karenbraganz opened a new issue, #57167:
URL: https://github.com/apache/airflow/issues/57167

   ### Apache Airflow version
   
   3.1.0
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   Supervision of a task instance failed after a single failed heartbeat 
attempt even though the max_failed_heartbeats is set to 3. This happened 
because an exception was raised when the [_handle_heartbeat_failures 
function](https://github.com/apache/airflow/blob/54bd5d8cd9f6f477cc83445737614dec81c4323c/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1116)
 was called.
   
   During the first failed heartbeat attempt, the _handle_heartbeat_failures 
function logs a message by calling log.warning(), which accepts an exception 
parameter that expects a string type object.  However, in the source code, [an 
exception type object is 
passed](https://github.com/apache/airflow/blob/54bd5d8cd9f6f477cc83445737614dec81c4323c/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1126)
 instead of a string type object. This results in a TypeError (like below) 
which causes task supervision to fail.
   ```
   TypeError: can only concatenate str (not "RemoteProtocolError") to str
   ```
   
   I have attached the stack trace from the worker logs.
   
   
[worker_typeerror.rtf](https://github.com/user-attachments/files/23104884/worker_typeerror.rtf)
   
   ### What you think should happen instead?
   
   I believe a string should be passed (like below) instead of an exception 
object 
[here](https://github.com/apache/airflow/blob/54bd5d8cd9f6f477cc83445737614dec81c4323c/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1126).
   ```
   exception=str(exc)
   ```
   
   Alternatively, I think we could pass `exc_info=True` instead of the 
exception parameter. This is what was done [until 
3.0.4](https://github.com/apache/airflow/blob/367d8680af355b492f256ab86aa738f9ee292f2f/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1039).
 I am not sure if there was a specific reason for changing this.
   
   ### How to reproduce
   
   Run a task instance and kill the API server once the task instance begins 
running.
   
   You will notice that the task supervision fails upon the first failed 
heartbeat.
   
   ### Operating System
   
   Debian GNU/Linux
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   This also affects Airflow 
[3.0.5](https://github.com/apache/airflow/blob/c8f60cc2f1cd8064ea1806bde1ed6726191c60d5/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1041)
 and 
[3.0.6](https://github.com/apache/airflow/blob/e965c2e676d85ced65a485d4b2601addc2fd3e97/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1119).
 Prior to that, 
[exc_info=True](https://github.com/apache/airflow/blob/367d8680af355b492f256ab86aa738f9ee292f2f/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L1039)
 was used.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to