jizhuozhi commented on issue #12937:
URL: https://github.com/apache/apisix/issues/12937#issuecomment-3797740195

   We use AWS EKS (Elastic Kubernetes Service) to run our online services, with 
Consul Agent (Daemonset) handling service registration and discovery. Our 
underlying node groups consist of a mix of on-demand and spot instances. Since 
spot instances can be reclaimed at any moment, Consul may not have sufficient 
time to perform deregistration, leading to inconsistent states in Consul. This 
includes, but is not limited to, returning nodes without services and returning 
CRITICAL nodes. The above is the output from our online API verification:
   
   ```
   curl 
"http://consul.consul:8500/v1/health/service/inference_server?passing=true";
   Returned 138 instances
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-119.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but 
still returned!
   ⚠️  WARNING: ip-[REDACTED]-185.ec2.internal has CRITICAL Serf status, but 
still returned!
   ```
   
   To be honest, I'm not entirely sure what's causing this issue (I'm not a 
Consul expert), but based on what we've observed, I've implemented 
compatibility measures in our internal release.
   
   ```lua
                       local service = node.Service
                       if not service then
                           goto CONTINUE_NODE
                       end
   
                       -- Filter out nodes with critical serf health check
                       -- Consul's passing=true only filters service checks, 
not node checks
                       if node.Checks then
                           local has_critical_serf = false
                           for _, check in ipairs(node.Checks) do
                               if check.CheckID == "serfHealth" and 
check.Status == "critical" then
                                   has_critical_serf = true
                                   log.warn("skip node ", node.Node.Node, 
                                           " for service ", service_name,
                                           " due to critical serf health check")
                                   break
                               end
                           end
                           if has_critical_serf then
                               goto CONTINUE_NODE
                           end
                       end
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to