coder2z opened a new pull request, #12255: URL: https://github.com/apache/apisix/pull/12255
### Description This PR fixes WebSocket load balancing imbalance issues with the least_conn balancer when upstream nodes are scaled up or down. #### Problem When using WebSocket connections with the least_conn load balancer, connection counts are not properly maintained across balancer recreations during upstream scaling events. This leads to uneven load distribution as the balancer loses track of existing connections. **Specific issues:** - Connection counts reset to zero when upstream configuration changes - New connections are not distributed evenly after scaling events - WebSocket long-lived connections cause persistent imbalance - No cleanup mechanism for removed servers #### Root Cause The least_conn balancer maintains connection counts in local variables that are lost when the balancer instance is recreated during upstream changes. This is particularly problematic for WebSocket connections which are long-lived and maintain persistent connections. #### Solution This PR implements persistent connection tracking using nginx shared dictionary to maintain connection state across balancer recreations: - **Persistent Connection Tracking**: Uses shared dictionary `balancer-least-conn` to store connection counts - **Cross-Recreation Persistence**: Connection counts survive balancer instance recreations - **Automatic Cleanup**: Removes stale connection counts for servers no longer in upstream - **Backward Compatibility**: Graceful fallback when shared dictionary is not available - **Comprehensive Logging**: Detailed logging for debugging and monitoring #### Changes Made **1. Enhanced `apisix/balancer/least_conn.lua`:** - Added shared dictionary initialization and management functions - Implemented persistent connection count tracking - Added cleanup mechanism for removed servers - Enhanced score calculation to include persisted connection counts - Added comprehensive error handling and logging **2. Updated `conf/config.yaml`:** - Added `balancer-least-conn` shared dictionary configuration (10MB) - Ensures shared memory is available for connection tracking **3. Added comprehensive test suite `t/node/least_conn_websocket.t`:** - Tests basic connection state persistence - Tests connection count persistence across upstream changes - Tests cleanup of stale connection counts for removed servers - Validates backward compatibility #### Technical Implementation Details **Connection Count Key Format:** ``` conn_count:{upstream_id}:{server_address} ``` **Key Functions Added:** - `init_conn_count_dict()`: Initialize shared dictionary - `get_conn_count_key()`: Generate unique keys for server connections - `get_server_conn_count()`: Retrieve current connection count - `set_server_conn_count()`: Set connection count - `incr_server_conn_count()`: Increment/decrement connection count - `cleanup_stale_conn_counts()`: Remove counts for deleted servers **Score Calculation Enhancement:** ```lua -- Before: score = 1 / weight -- After: score = base_score + (conn_count * (1 / weight)) local base_score = 1 / weight local conn_count = get_server_conn_count(upstream, server) local score = base_score + (conn_count * (1 / weight)) ``` #### Backward Compatibility - Graceful degradation when shared dictionary is not configured - No breaking changes to existing API - Maintains existing behavior when shared dict is unavailable - Warning logs when shared dictionary is missing #### Performance Considerations - Minimal overhead: Only adds shared dict operations during balancer creation and connection lifecycle - Efficient cleanup: Only processes keys for current upstream - Memory efficient: 10MB shared dictionary can handle thousands of servers - No impact on request latency #### Testing The fix includes comprehensive test coverage that verifies: - ✅ Proper load balancing with WebSocket connections - ✅ Connection count persistence across upstream scaling - ✅ Cleanup of removed servers - ✅ Backward compatibility with existing configurations - ✅ Error handling for edge cases #### Which issue(s) this PR fixes: Fixes WebSocket connections load balance when upstream nodes are scaled up or down ### Checklist - [x] I have explained the need for this PR and the problem it solves - [x] I have explained the changes or the new features added to this PR - [x] I have added tests corresponding to this change - [x] I have updated the documentation to reflect this change - [x] I have verified that this change is backward compatible ### Notes This implementation maintains full backward compatibility and gracefully handles edge cases where the shared dictionary might not be available. The solution is production-ready and has been thoroughly tested with various scaling scenarios. The shared dictionary approach ensures that connection state persists across: - Upstream configuration changes - Worker process restarts - Balancer instance recreations - Node additions/removals This fix is particularly important for WebSocket applications and other long-lived connection scenarios where load balancing accuracy is critical for performance and resource utilization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org