alex-trm commented on issue #26542:
URL: https://github.com/apache/airflow/issues/26542#issuecomment-2605939339

   I am able to reliably reproduce the issue using the official Helm chart.  I 
haven't been able to do this before so I'd like to document the process here:
   
   - Scale down Redis `statefulset` to 0 replicas
   - Wait for worker to disconnect, and add about 10 seconds to that
   - Scale up Redis `statefulset` to 1 replica
   
   This seems to reliably disconnect the worker from Redis (expected, since the 
Redis pod is going down) and then reconnect the worker to Redis without the 
ability to run tasks.  Shouldn't take 2-3 cycles of the above steps to trigger 
the catatonic state where the worker says that it's connected, isn't killed by 
the liveness probe, but can't accept tasks from the queue.
   
   ---
   
   I thought that it would be easier to just delete the Redis pod and have it 
restart without changing the number of replicas but that doesn't seem to do the 
trick.  The `kombu` link I posted above mentions GC issues and I'm wondering if 
keeping Redis down for ~10 seconds is long enough for a few connection retries 
and *maybe a GC loop* but not so long that the liveness probe kills the worker. 
 Regardless, these steps seem to reliably reproduce the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to