dstandish commented on issue #26542:
URL: https://github.com/apache/airflow/issues/26542#issuecomment-1904486710

   We were able to improve the behavior with these settings:
   ```
   AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_TIMEOUT=30
   AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_CONNECT_TIMEOUT=5
   AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_KEEPALIVE=True
   AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__RETRY_ON_TIMEOUT=True
   ```
   
   In our repro (which was to force kill the redis pod) the celery worker would 
get stuck waiting forever for a response when it tried to heartbeat itself to 
the failed redis pod.  I believe adding the socket timeout allows the 
connection to close in this scenario, which ultimately allows the container to 
be restarted.
   
   There's also a similar issue reported in celery that may possibly be 
resolved in 5.4.0 (see 
[comment](https://github.com/celery/celery/discussions/7276#discussioncomment-8160246)).
  But I have not tested it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to