vortegatorres opened a new issue, #2708:
URL: https://github.com/apache/apisix-ingress-controller/issues/2708
### Current Behavior
After scaling a backend Deployment up and then back down, APISIX continues
to keep the removed Pod IP in the upstream/endpoints set and keeps trying to
route traffic to it.
**Observed timeline (UTC, sanitized IPs):**
- ~11:20 UTC: scaled backend from 3 → 4 replicas for ~1 hour (new pod IP =
X).
- ~12:20 UTC: scaled back down 4 → 3 (pod with IP X removed).
- After scale-down, APISIX still routed to X:8000 (stale) and produced
spikes of:
- 111: Connection refused
- Many successful traces/logs during the error window showed both stale and
valid upstreams at the same time, e.g. (second request succeded after retried):
-`upstream_addr`: "X:8000, Y:8000"
- Hours later (~04:00 UTC), after restarting the APISIX deployment, the
stale IP stopped appearing in upstream_addr and the errors stopped.
### Expected Behavior
After the Pod is deleted and Kubernetes Endpoints/EndpointSlices are
updated, APISIX should stop routing to the removed Pod IP immediately (or
within the expected controller sync/update window), and should not keep stale
upstream targets.
### Error Logs
For some requests, we first show the `(111: Connection refused)` error.
```
2026-01-29 04:05:10.638 error 2026/01/29 03:05:10 [error] 49#49: *1392456
connect() failed (111: Connection refused) while connecting to upstream,
client: XXXXXX, server: _, request: "POST /api HTTP/1.1", upstream:
"http://YYYYYY:8000/api", host: "url"
```
And then the retried successfully request with the two IPs in the
`upstream_addr` (staled IP + correct IP):
```
2026-01-29 04:05:11.430 {
"ts": "2026-01-29T03:05:11+00:00",
"service": "apisix",
"resp_body_size": "0",
"host": "url",
"address": "XXXXX",
"request_length": "595",
"method": "POST",
"uri": "/api",
"status": "204",
"user_agent": "Go-http-client/2.0",
"resp_time": "0.512",
"upstream_addr": "X:8000, Y:8000",
"upstream_status": "502, 204",
"traceparent": "00-0bc72f67dd2ee53ce6d7f03e4a4eb7d6-3a5171c169e8eb46-01",
"trace_id": "0bc72f67dd2ee53ce6d7f03e4a4eb7d6",
"span_id": "3a5171c169e8eb46",
"org_slug": "",
"matched_uri": ""
}
```
### Steps to Reproduce
This error is not deterministic; it doesn't always happen, but if you were
to try to reproduce it, this would be the closest way to do so:
- Deploy APISIX using the APISIX Helm chart v2.12.6 in standalone mode (no
etcd).
- Deploy a backend service behind APISIX (any Kubernetes Service/Deployment
that APISIX routes to).
- Run steady traffic through APISIX to that backend.
- Temporarily scale the backend Deployment up (e.g., from 3 replicas to 4)
and keep it running for ~1 hour.
- Scale the backend back down to the original replica count (e.g., 4 → 3),
so one Pod is terminated and its IP is removed.
### Environment
- APISIX Helm chart 2.12.6.
- APISIX Ingress controller version: 2.0.1
- Kubernetes cluster version: 1.32
- OS version if running APISIX Ingress controller in a bare-metal
environment: Linux x86_64 GNU/Linux
- ACD: 0.23.1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]