jizhuozhi opened a new issue, #12846:
URL: https://github.com/apache/apisix/issues/12846
### Current Behavior
While integrating APISIX as an ingress / access gateway for large downstream
services, I encountered a discovery synchronization issue that appears to be
structural rather than configuration-related.
During routine testing last night (this surfaced as part of ongoing work),
APISIX failed to update service discovery state with the following error:
```
2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event
failure with discovery_consul_update_all_services, update all services error:
failed to publish event: payload exceeds the limitation (65536), context:
ngx.timer
```
This happened during a full discovery update from the registry (Consul in
this case), triggered by a timer.
### Expected Behavior
APISIX currently synchronizes discovery state from the control plane to
workers by broadcasting **full snapshots** (services + endpoints) via worker
events.
Worker events have a **default payload size limit of 64KB**, which makes
this model inherently non-scalable:
* Even **without endpoint metadata**, a single service with thousands of
IP:port pairs can exceed this limit.
* Adding metadata (e.g., for subset routing, lanes, versions) increases
payload size and accelerates hitting the ceiling.
* Large downstream services are increasingly common in APISIX as a
centralized ingress / access platform.
This limitation leads to:
* inconsistent worker state
* stale endpoints
* broken routing or subset selection
Importantly, this applies to any registry backend using the same broadcast
model, not only Consul.
---
### Why this is a structural issue
The current design assumes that discovery state can be delivered as a single
event payload. In large-scale environments:
* Endpoint count grows with traffic and business scale
* Event payload size grows linearly with endpoint count
* The 64KB limit becomes an inevitable guardrail
As a result, this failure mode will occur in sufficiently large deployments.
### Error Logs
2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event
failure with discovery_consul_update_all_services, update all services error:
failed to publish event: payload exceeds the limitation (65536), context:
ngx.timer
### Steps to Reproduce
1. Prepare a Consul registry (local or remote) and configure APISIX to use
Consul as the service discovery backend.
2. Register a large number of instances for a single service in Consul.
This can be done by repeatedly calling Consul’s
`/v1/agent/service/register` API to register hundreds or thousands of instances
under the same service name.
Each instance uses a unique `ID`, but shares the same `Name`, `Address`,
and `Port`, and includes instance-level metadata via the `Meta` field (used for
endpoint matching in APISIX).
3. Ensure that the total number of registered instances is large enough
(e.g. 1000+) so that the serialized discovery snapshot (service + endpoints +
metadata) exceeds the size that can be carried by a single worker event.
4. Wait for APISIX to trigger a full discovery synchronization (e.g. via the
periodic discovery timer), or restart APISIX to force an initial full sync.
5. Observe the APISIX error log.
When the full discovery snapshot is broadcast via worker events, APISIX
fails to publish the event and logs an error similar to:
```
[lua] init.lua:583: post_event failure with
discovery_consul_update_all_services, update all services error: failed to
publish event: payload exceeds the limitation (65536), context: ngx.timer
```
6. At this point, workers stop receiving updated discovery state for the
affected service, resulting in stale or inconsistent endpoint information.
```python
#!/usr/bin/env python3
"""
Bulk register multiple services in Consul to reproduce APISIX discovery
payload overflow,
using Meta fields instead of tags.
"""
import requests
CONSUL_HOST = "http://127.0.0.1:8500" # Consul address
SERVICE_NAME = "billing.ad.oversea.cost-entry"
INSTANCE_COUNT = 1000 # adjust to trigger payload limit
SERVICE_PORT = 8080
for i in range(1, INSTANCE_COUNT + 1):
service_id = f"{SERVICE_NAME}-{i}"
payload = {
"ID": service_id,
"Name": SERVICE_NAME,
"Address": "127.0.0.1",
"Port": SERVICE_PORT,
"Meta": {
"version": "v1",
"region": "us-east-1"
}
}
resp = requests.put(f"{CONSUL_HOST}/v1/agent/service/register",
json=payload)
if resp.status_code != 200:
print(f"Failed to register {service_id}: {resp.text}")
else:
if i % 100 == 0:
print(f"Registered {i} instances...")
print("Done. Now trigger APISIX discovery sync to reproduce the issue.")
```
### Environment
- APISIX version (run `apisix version`):
- Operating system (run `uname -a`):
- OpenResty / Nginx version (run `openresty -V` or `nginx -V`):
- etcd version, if relevant (run `curl
http://127.0.0.1:9090/v1/server_info`):
- APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run `luarocks --version`):
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]