jizhuozhi opened a new issue, #12846:
URL: https://github.com/apache/apisix/issues/12846

   ### Current Behavior
   
   While integrating APISIX as an ingress / access gateway for large downstream 
services, I encountered a discovery synchronization issue that appears to be 
structural rather than configuration-related.
   
   During routine testing last night (this surfaced as part of ongoing work), 
APISIX failed to update service discovery state with the following error:
   
   ```
   2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event 
failure with discovery_consul_update_all_services, update all services error: 
failed to publish event: payload exceeds the limitation (65536), context: 
ngx.timer
   ```
   
   This happened during a full discovery update from the registry (Consul in 
this case), triggered by a timer.
   
   ### Expected Behavior
   
   APISIX currently synchronizes discovery state from the control plane to 
workers by broadcasting **full snapshots** (services + endpoints) via worker 
events.
   
   Worker events have a **default payload size limit of 64KB**, which makes 
this model inherently non-scalable:
   
   * Even **without endpoint metadata**, a single service with thousands of 
IP:port pairs can exceed this limit.
   * Adding metadata (e.g., for subset routing, lanes, versions) increases 
payload size and accelerates hitting the ceiling.
   * Large downstream services are increasingly common in APISIX as a 
centralized ingress / access platform.
   
   This limitation leads to:
   
   * inconsistent worker state
   * stale endpoints
   * broken routing or subset selection
   
   Importantly, this applies to any registry backend using the same broadcast 
model, not only Consul.
   
   ---
   
   ### Why this is a structural issue
   
   The current design assumes that discovery state can be delivered as a single 
event payload. In large-scale environments:
   
   * Endpoint count grows with traffic and business scale
   * Event payload size grows linearly with endpoint count
   * The 64KB limit becomes an inevitable guardrail
   
   As a result, this failure mode will occur in sufficiently large deployments.
   
   ### Error Logs
   
   2025/12/26 12:32:39 [error] 49#49: *734 [lua] init.lua:583: post_event 
failure with discovery_consul_update_all_services, update all services error: 
failed to publish event: payload exceeds the limitation (65536), context: 
ngx.timer
   
   ### Steps to Reproduce
   
   1. Prepare a Consul registry (local or remote) and configure APISIX to use 
Consul as the service discovery backend.
   
   2. Register a large number of instances for a single service in Consul.
   
      This can be done by repeatedly calling Consul’s 
`/v1/agent/service/register` API to register hundreds or thousands of instances 
under the same service name.
      Each instance uses a unique `ID`, but shares the same `Name`, `Address`, 
and `Port`, and includes instance-level metadata via the `Meta` field (used for 
endpoint matching in APISIX).
   
   3. Ensure that the total number of registered instances is large enough 
(e.g. 1000+) so that the serialized discovery snapshot (service + endpoints + 
metadata) exceeds the size that can be carried by a single worker event.
   
   4. Wait for APISIX to trigger a full discovery synchronization (e.g. via the 
periodic discovery timer), or restart APISIX to force an initial full sync.
   
   5. Observe the APISIX error log.
      When the full discovery snapshot is broadcast via worker events, APISIX 
fails to publish the event and logs an error similar to:
   
   ```
   [lua] init.lua:583: post_event failure with 
discovery_consul_update_all_services, update all services error: failed to 
publish event: payload exceeds the limitation (65536), context: ngx.timer
   ```
   
   6. At this point, workers stop receiving updated discovery state for the 
affected service, resulting in stale or inconsistent endpoint information.
   
   ```python
   #!/usr/bin/env python3
   """
   Bulk register multiple services in Consul to reproduce APISIX discovery 
payload overflow,
   using Meta fields instead of tags.
   """
   
   import requests
   
   CONSUL_HOST = "http://127.0.0.1:8500";  # Consul address
   SERVICE_NAME = "billing.ad.oversea.cost-entry"
   INSTANCE_COUNT = 1000  # adjust to trigger payload limit
   SERVICE_PORT = 8080
   
   for i in range(1, INSTANCE_COUNT + 1):
       service_id = f"{SERVICE_NAME}-{i}"
       payload = {
           "ID": service_id,
           "Name": SERVICE_NAME,
           "Address": "127.0.0.1",
           "Port": SERVICE_PORT,
           "Meta": {
               "version": "v1",
               "region": "us-east-1"
           }
       }
       resp = requests.put(f"{CONSUL_HOST}/v1/agent/service/register", 
json=payload)
       if resp.status_code != 200:
           print(f"Failed to register {service_id}: {resp.text}")
       else:
           if i % 100 == 0:
               print(f"Registered {i} instances...")
   
   print("Done. Now trigger APISIX discovery sync to reproduce the issue.")
   ```
   
   ### Environment
   
   - APISIX version (run `apisix version`):
   - Operating system (run `uname -a`):
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`):
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`):
   - APISIX Dashboard version, if relevant:
   - Plugin runner version, for issues related to plugin runners:
   - LuaRocks version, for installation issues (run `luarocks --version`):
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to