alex-trm commented on issue #26542: URL: https://github.com/apache/airflow/issues/26542#issuecomment-2605939339
I am able to reliably reproduce the issue using the official Helm chart. I haven't been able to do this before so I'd like to document the process here: - Scale down Redis `statefulset` to 0 replicas - Wait for worker to disconnect, and add about 10 seconds to that - Scale up Redis `statefulset` to 1 replica This seems to reliably disconnect the worker from Redis (expected, since the Redis pod is going down) and then reconnect the worker to Redis without the ability to run tasks. Shouldn't take 2-3 cycles of the above steps to trigger the catatonic state where the worker says that it's connected, isn't killed by the liveness probe, but can't accept tasks from the queue. --- I thought that it would be easier to just delete the Redis pod and have it restart without changing the number of replicas but that doesn't seem to do the trick. The `kombu` link I posted above mentions GC issues and I'm wondering if keeping Redis down for ~10 seconds is long enough for a few connection retries and *maybe a GC loop* but not so long that the liveness probe kills the worker. Regardless, these steps seem to reliably reproduce the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
