shreemaan-abhishek commented on PR #13579: URL: https://github.com/apache/apisix/pull/13579#issuecomment-4786651822
Thanks for the careful read, you're right that this isn't behavior-neutral. I dug into the why so the trade-off is explicit. The staged connect timeout (100ms first attempt, 800ms on retries) and the lower `max_redirection` / `max_connection_attempts` defaults (2/2) are intentional, not accidental. They come from [api7/lua-resty-redis-cluster#10](https://github.com/api7/lua-resty-redis-cluster/pull/10) ("add healthcheck for failed master nodes and refactor retry strategy"). The goal there was to keep gateway CPU bounded when a redis node goes down: instead of every request blocking for the full `redis_timeout` on a dead node's connect (which piles up under load), the lib fails the connect fast, records the node as unhealthy in a shared dict, short-circuits subsequent requests to it, and a background timer re-probes to rehabilitate. That makes it a fast-fail circuit breaker rather than a timeout-exhaustion path. Two things worth noting on scope: - `send_timeout` / `read_timeout` on these paths still honor `redis_timeout`, and the non-retry paths (`_do_cmd_master`, pipeline) still honor `connect_timeout` fully. So `redis_timeout` is not bypassed wholesale, only the per-attempt connect budget in the retry/discovery loops is fixed. - The breaker needs a `redis_cluster_health` shared dict plus `resty.rediscluster.init()` to start its health-check timer; without them it is inert. I've added both in this PR (dict guarded on the limit-conn/limit-req/limit-count plugins, `init()` started in the privileged agent), which is the same wiring this library has run with in production. On your core concern, high-latency / TLS / cross-AZ clusters where a 100ms first-attempt connect may routinely fail: that's a real limitation, and it's that the staged floor is hardcoded in the library rather than configurable. I'd prefer to fix that upstream in the library (make the connect floor configurable, or honor `connect_timeout` as a floor when it is larger) as a follow-up, rather than reverting the fast-fail design here, since reverting would re-introduce the CPU problem #10 solved. I'll open that on the library and add a note on the connect-timeout semantics to the limit-count docs so the change is disclosed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
