arjunbhut commented on issue #2749:
URL:
https://github.com/apache/apisix-ingress-controller/issues/2749#issuecomment-4421267853
Hi @Baoyuantop — I'd like to add a second-trigger to this issue. We're
hitting the same `*_conf_version must be greater than or equal to (X)`
rejections, but our setup never touches the Admin API, never mixes management,
and the issue still recurs every sync cycle. I believe there's a structural
multi-replica drift in api-driven standalone that the current diagnosis doesn't
cover.
## Setup
| Component | Version |
|---|---|
| Helm chart | `apisix/apisix:2.14.0` |
| APISIX | `apache/apisix:3.16.0-ubuntu` |
| Ingress controller | `apache/apisix-ingress-controller:2.0.1` |
| ADC sidecar | `ghcr.io/api7/adc:0.25.0` (overridden from chart default
`0.23.1`) |
| Replicas | **10** |
| Provider | `apisix-standalone` (per official install guide) |
Deployment values follow the docs verbatim:
```yaml
apisix.deployment.role: traditional
apisix.deployment.role_traditional.config_provider: yaml
etcd.enabled: false
ingress-controller.config.provider.type: apisix-standalone
```
## What we *don't* do
- No `curl` against the Admin API (port 9180).
- No `kubectl exec` to write config files.
- No APISIX Dashboard.
- Config managed exclusively via `Gateway`, `HTTPRoute`, and `GatewayProxy`
CRDs (Gateway API + `apisix.apache.org/v1alpha1`).
- Fresh cluster: `helm uninstall` + reinstall, fresh CRD apply, fresh
secrets — same drift recurs within minutes.
## Symptom
ADC sync cycle every ~minute. Each cycle, **a different subset of pods
rejects with 400**, citing some `*_conf_version must be >= (epoch_ms)`. Sample
over a 5-minute window (output from ADC 0.25.0's per-endpoint status — much
appreciated, by the way):
```
cycle 1: success=9, failed=1 pod 10.53.4.224 ssls_conf_version must be >=
1778497564298
cycle 2: success=8, failed=2 pod 10.53.4.224 ssls_conf_version
pod 10.53.4.221 upstreams_conf_version
cycle 3: success=9, failed=1 pod 10.53.3.55 services_conf_version
cycle 4: success=10 (one good sync)
cycle 5: success=8, failed=2 pod 10.53.4.222, 10.53.4.220
ssls_conf_version
```
Note that **the stuck pod rotates**. It's not stuck-forever-on-one-pod
(which would match the user-triggered single-spike scenario in this issue).
External effect: TLS handshake succeeds for traffic landing on the "good" pods,
fails (`tlsv1 alert internal_error` because no SSL loaded) on the "stuck" pods.
Hit rate fluctuates between ~30% and ~85% depending on which pods are currently
behind.
## Why "don't mix management" doesn't apply here
The diagnosis above suggests the user pushed a high version via Admin API,
so ADC's lower number gets rejected. That's a single-trigger scenario,
recoverable by restart-both-sides. We've:
1. Confirmed no Admin API access from our side (verified via apisix pod
access logs — only `axios/1.13.2` UA, the controller's ADC client).
2. Done the recommended `delete apisix pod + restart ingress-controller`
recovery. Works for a sync cycle or two, then drift returns.
3. Fully torn down: `helm uninstall apisix`, delete all CRs, fresh `helm
install`. Drift still recurs within minutes.
What we're seeing is **organic, structural drift** between N independent
in-memory version counters and ADC's local CacheKey, with no user-induced
trigger.
## Hypothesis
`*_conf_version` is per-pod in-memory state. ADC sends one canonical version
to N pods in parallel. Any time *one* pod's accept lands a moment later, or a
retry pushes a slightly higher version to a single pod, that pod is now
\"ahead\" of ADC's local CacheKey *and* ahead of the other N-1 pods. On the
next periodic sync, ADC's version is below the ahead-pod's required minimum →
400. ADC may not be updating its CacheKey from the maximum observed across
pods, only from its local notion.
In api-driven standalone with multiple replicas, this divergence appears
unbounded: each cycle can produce a new \"ahead\" pod (because every cycle has
its own micro-race), so the set of stuck pods rotates over time.
## Reproducible?
Yes, in any cluster with `replicaCount: 10` + `apisix-standalone` provider +
only-CRD management. Happy to provide:
- Helm values used.
- ADC controller logs across many cycles showing the rotation.
- Full per-pod 204/400 distribution.
## Questions
1. Is there a documented multi-replica-safe configuration for
`apisix-standalone`, or is the deployment mode currently single-replica by
design intent?
2. Would it be feasible for ADC to update its local CacheKey from
`max(local, last_observed_per_pod)` rather than only from its own writes? That
would eliminate the drift after one full cycle.
3. Is there a \`disable_conf_version_check\` or similar option for
environments where the controller is the only writer?
If the answer to (1) is \"single replica only,\" that should probably be
more prominent in the [Install with
Helm](https://apisix.apache.org/docs/ingress-controller/install/) guide — the
canonical example does not call this out, and 10 replicas seemed like a
reasonable default for an ingress.
Happy to file as a separate \`bug:\` if the maintainers prefer keeping this
issue scoped to the original (Admin API mixed-management) trigger.
---
*Comment drafted with Claude Code assistance, reviewed and posted by
@arjunbhut.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]