vortegatorres opened a new issue, #2708:
URL: https://github.com/apache/apisix-ingress-controller/issues/2708

   ### Current Behavior
   
   After scaling a backend Deployment up and then back down, APISIX continues 
to keep the removed Pod IP in the upstream/endpoints set and keeps trying to 
route traffic to it.
   
   **Observed timeline (UTC, sanitized IPs):**
   - ~11:20 UTC: scaled backend from 3 → 4 replicas for ~1 hour (new pod IP = 
X).
   - ~12:20 UTC: scaled back down 4 → 3 (pod with IP X removed).
   - After scale-down, APISIX still routed to X:8000 (stale) and produced 
spikes of:
     - 111: Connection refused
   - Many successful traces/logs during the error window showed both stale and 
valid upstreams at the same time, e.g. (second request succeded after retried):
     -`upstream_addr`: "X:8000, Y:8000"
   - Hours later (~04:00 UTC), after restarting the APISIX deployment, the 
stale IP stopped appearing in upstream_addr and the errors stopped.
   
   ### Expected Behavior
   
   After the Pod is deleted and Kubernetes Endpoints/EndpointSlices are 
updated, APISIX should stop routing to the removed Pod IP immediately (or 
within the expected controller sync/update window), and should not keep stale 
upstream targets.
   
   ### Error Logs
   
   For some requests, we first show the `(111: Connection refused)` error.
   ```
   2026-01-29 04:05:10.638 error 2026/01/29 03:05:10 [error] 49#49: *1392456 
connect() failed (111: Connection refused) while connecting to upstream, 
client: XXXXXX, server: _, request: "POST /api HTTP/1.1", upstream: 
"http://YYYYYY:8000/api";, host: "url" 
   ```
   And then the retried successfully request with the two IPs in the 
`upstream_addr` (staled IP + correct IP):
   ```
   2026-01-29 04:05:11.430 {
     "ts": "2026-01-29T03:05:11+00:00",
     "service": "apisix",
     "resp_body_size": "0",
     "host": "url",
     "address": "XXXXX",
     "request_length": "595",
     "method": "POST",
     "uri": "/api",
     "status": "204",
     "user_agent": "Go-http-client/2.0",
     "resp_time": "0.512",
     "upstream_addr": "X:8000, Y:8000",
     "upstream_status": "502, 204",
     "traceparent": "00-0bc72f67dd2ee53ce6d7f03e4a4eb7d6-3a5171c169e8eb46-01",
     "trace_id": "0bc72f67dd2ee53ce6d7f03e4a4eb7d6",
     "span_id": "3a5171c169e8eb46",
     "org_slug": "",
     "matched_uri": ""
   } 
   ```
   
   ### Steps to Reproduce
   
   This error is not deterministic; it doesn't always happen, but if you were 
to try to reproduce it, this would be the closest way to do so:
   - Deploy APISIX using the APISIX Helm chart v2.12.6 in standalone mode (no 
etcd).
   - Deploy a backend service behind APISIX (any Kubernetes Service/Deployment 
that APISIX routes to).
   - Run steady traffic through APISIX to that backend.
   - Temporarily scale the backend Deployment up (e.g., from 3 replicas to 4) 
and keep it running for ~1 hour.
   - Scale the backend back down to the original replica count (e.g., 4 → 3), 
so one Pod is terminated and its IP is removed.
   
   ### Environment
   
   - APISIX Helm chart 2.12.6.
   - APISIX Ingress controller version: 2.0.1
   - Kubernetes cluster version: 1.32
   - OS version if running APISIX Ingress controller in a bare-metal 
environment: Linux x86_64 GNU/Linux
   - ACD: 0.23.1 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to