jizhuozhi commented on issue #12937: URL: https://github.com/apache/apisix/issues/12937#issuecomment-3797740195
We use AWS EKS (Elastic Kubernetes Service) to run our online services, with Consul Agent (Daemonset) handling service registration and discovery. Our underlying node groups consist of a mix of on-demand and spot instances. Since spot instances can be reclaimed at any moment, Consul may not have sufficient time to perform deregistration, leading to inconsistent states in Consul. This includes, but is not limited to, returning nodes without services and returning CRITICAL nodes. The above is the output from our online API verification: ``` curl "http://consul.consul:8500/v1/health/service/inference_server?passing=true" Returned 138 instances ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but still returned! ⚠️ WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but still returned! ``` To be honest, I'm not entirely sure what's causing this issue (I'm not a Consul expert), but based on what we've observed, I've implemented compatibility measures in our internal release. ```lua local service = node.Service if not service then goto CONTINUE_NODE end -- Filter out nodes with critical serf health check -- Consul's passing=true only filters service checks, not node checks if node.Checks then local has_critical_serf = false for _, check in ipairs(node.Checks) do if check.CheckID == "serfHealth" and check.Status == "critical" then has_critical_serf = true log.warn("skip node ", node.Node.Node, " for service ", service_name, " due to critical serf health check") break end end if has_critical_serf then goto CONTINUE_NODE end end ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
