Re: [I] bug: Discovery synchronization fails for large services due to shared memory payload constraints [apisix]

via GitHub Sat, 27 Dec 2025 05:09:59 -0800


jizhuozhi commented on issue #12846:
URL: https://github.com/apache/apisix/issues/12846#issuecomment-3693967865


   ### Approaches
   
   #### 1. Emit discovery events per service
   
   Split `update_all_services` into per-service events.
   
   * ✅ Mitigates the issue for multi-service setups
   * ❌ Still fails for a single very large service
   * ❌ Does not fundamentally remove the payload size ceiling
   
   ---
   
   ### 2. Adjust the shared memory size for events
   
   The worker events mechanism already stores event payloads in shared memory.
   
   * ✅ Keeping event payloads within the current shm limits can reduce 
immediate failures
   * ❌ Discovery updates are still fundamentally constrained by shm capacity
   * ❌ Requires continuous tuning of shm size to accommodate large payloads
   * ❌ Does **not** eliminate the structural scalability limitation of full 
snapshot broadcasts
   
   In practice, this turns the problem into **ongoing shm tuning**, rather than 
a true fix for large-scale deployments.
   
   ---
   
   #### 3. Let workers pull directly from the registry
   
   Master only watches; workers fetch endpoints themselves.
   
   * ✅ Simple to implement
   * ❌ Multiplies registry load by worker count
   * ❌ Breaks the current “single watcher” design
   * ❌ Not registry-friendly at scale
   
   ---
   
   #### 4. Chunked / streaming discovery updates (preferred direction)
   
   Split large discovery payloads into multiple smaller chunks and deliver them 
incrementally via events, allowing workers to assemble the full state.
   
   * ✅ Avoids exceeding event payload limits
   * ✅ Preserves existing master/worker responsibilities
   * ✅ Does not increase registry pressure
   * ✅ Can evolve toward incremental or streaming discovery in the future
   * ❌ Requires chunk ordering, versioning, and completion semantics
   
   ---
   
   ### Additional Context: Kubernetes HPA makes this issue inevitable
   
   This issue surfaced after enabling Kubernetes HPA for downstream services. 
Initially, the number of endpoints was small and the problem was not 
observable. As HPA scaled the number of pods, discovery payload size grew until 
it exceeded the 64KB event limit.
   
   * The problem does **not** appear at deployment time but emerges naturally 
as traffic grows
   * In HPA-enabled environments, this failure mode is **inevitable**, not 
accidental
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] bug: Discovery synchronization fails for large services due to shared memory payload constraints [apisix]

Reply via email to