jizhuozhi commented on issue #12846: URL: https://github.com/apache/apisix/issues/12846#issuecomment-3693967865
### Approaches #### 1. Emit discovery events per service Split `update_all_services` into per-service events. * ✅ Mitigates the issue for multi-service setups * ❌ Still fails for a single very large service * ❌ Does not fundamentally remove the payload size ceiling --- ### 2. Adjust the shared memory size for events The worker events mechanism already stores event payloads in shared memory. * ✅ Keeping event payloads within the current shm limits can reduce immediate failures * ❌ Discovery updates are still fundamentally constrained by shm capacity * ❌ Requires continuous tuning of shm size to accommodate large payloads * ❌ Does **not** eliminate the structural scalability limitation of full snapshot broadcasts In practice, this turns the problem into **ongoing shm tuning**, rather than a true fix for large-scale deployments. --- #### 3. Let workers pull directly from the registry Master only watches; workers fetch endpoints themselves. * ✅ Simple to implement * ❌ Multiplies registry load by worker count * ❌ Breaks the current “single watcher” design * ❌ Not registry-friendly at scale --- #### 4. Chunked / streaming discovery updates (preferred direction) Split large discovery payloads into multiple smaller chunks and deliver them incrementally via events, allowing workers to assemble the full state. * ✅ Avoids exceeding event payload limits * ✅ Preserves existing master/worker responsibilities * ✅ Does not increase registry pressure * ✅ Can evolve toward incremental or streaming discovery in the future * ❌ Requires chunk ordering, versioning, and completion semantics --- ### Additional Context: Kubernetes HPA makes this issue inevitable This issue surfaced after enabling Kubernetes HPA for downstream services. Initially, the number of endpoints was small and the problem was not observable. As HPA scaled the number of pods, discovery payload size grew until it exceeded the 64KB event limit. * The problem does **not** appear at deployment time but emerges naturally as traffic grows * In HPA-enabled environments, this failure mode is **inevitable**, not accidental -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
