hebbaa opened a new issue, #12767:
URL: https://github.com/apache/apisix/issues/12767

   ### Current Behavior
   
   ### Issue Description
   We are observing a significant discrepancy within a single Apache APISIX pod 
where the aggregated Prometheus metric for active connections 
(apisix_nginx_http_current_connections{state="active"}) reports a much higher 
number than the actual number of open TCP sockets reported by the operating 
system's lsof command for all Nginx processes within that pod.
   This over-reporting seems related to Nginx worker process recycling, 
suggesting a flaw in how the nginx-lua-prometheus module or APISIX aggregation 
handles statistics from older, draining worker processes.
   
   The investigation is based on the following specific data observed within a 
single APISIX Kubernetes pod:
   
   **1. System Metrics via curl (Prometheus Endpoint)**
   The endpoint reported these values for current connections:
   ```
   apisix_nginx_http_current_connections{state="active"} 58193
   apisix_nginx_http_current_connections{state="accepted"} 234773641
   apisix_nginx_http_current_connections{state="handled"} 234773641
   apisix_nginx_http_current_connections{state="waiting"} 58076
   apisix_nginx_http_current_connections{state="writing"} 89
   ```
   **2. Operating System File Descriptor Count (lsof)Running lsof across all 
Nginx PIDs within that same pod returned a significantly lower number:**
   ```
   lsof -iTCP -a -p $(pgrep nginx | tr '\n' ',') | wc -l
   39385
   ```
   **3. Process List (ps -eaf and stat)**
   The process list revealed four total worker processes, with two running much 
longer than the others, indicating a configuration reload event occurred:
   
   ```
   # Process list output subset
   PID   USER     TIME  COMMAND
      51 apisix    1d06 {openresty} nginx: worker process  # Long running (~1 
day CPU time)
      52 apisix    1d06 {openresty} nginx: worker process  # Long running (~1 
day CPU time)
     114 apisix   33:48 {openresty} nginx: worker process  # Short running (~34 
minutes CPU time)
     116 apisix   23:32 {openresty} nginx: worker process  # Short running (~24 
minutes CPU time)
   
   # Timestamp of new worker process 114 start time
   stat /proc/114
   Change: 2025-11-23 09:38:44.660489426 +0000 
   ```
   
   **4. System Configuration**
   Configured Limit: worker_connections 10620;
   Total Expected Max Connections (Per Pod): 4 workers * 10620 
connections/worker = 42,480 total connections.
   
   ### Analysis of Observations and Inconsistencies
   We have identified two primary inconsistencies based on this data:
   Inconsistency 1: Metric Count vs. OS Count
   The Nginx metric reports ~58,193 active connections.
   The actual operating system reality (lsof) reports ~39,385 open TCP sockets 
(file descriptors).
   The lsof count is the ground truth for actual resource consumption. The 
Nginx metric is overstating the active connection count by nearly 19,000 
connections (roughly the number of active connections 2 worker process might 
hold).
   Inconsistency 2: Active Connections vs. Configured Limit
   The OS-level count of ~39k is actually within the configured maximum of 
42,480. However, the Nginx metric reports ~58k, suggesting the system is vastly 
exceeding its limits, when in reality, it is not.
   Conclusion on the Discrepancy
   The most logical explanation for these observations is a bug in how the 
APISIX Prometheus metrics module aggregates statistics across different 
generations of Nginx worker processes within the same pod.
   When Nginx reloads gracefully (starting PIDs 114 and 116 while PIDs 51 and 
52 drain), the metrics module seems to be incorrectly summing the statistics of 
both the old, draining workers and the new, active workers simultaneously, 
resulting in a misleadingly high "active" count that doesn't reflect the actual 
consumed file descriptors shown by lsof.
   The true number of active connections for this pod is closer to 39,385.
   
   
   
   
   
   
   
   
   
   ### Expected Behavior
   
   The true number of active connections for this pod is closer to 39,385.
   
   
   ### Error Logs
   
   _No response_
   
   ### Steps to Reproduce
   
   Observed in production deployment
   
   ### Environment
   
   - APISIX version (run `apisix version`):3.14.1
   - Operating system (run `uname -a`):Linux data-plane-7c4b955754-dp552 
5.10.230-223.885.amzn2.x86_64 #1 SMP Tue Dec 3 14:36:00 UTC 2024 x86_64 
GNU/Linux
   
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`):nginx 
version: openresty/1.27.1.2 (x86_64-pc-linux-gnu)
   built by gcc 15.2.0 (Wolfi 15.2.0-r3) 
   built with OpenSSL 3.6.0 1 Oct 2025
   TLS SNI support enabled
   
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`):
   - APISIX Dashboard version, if relevant:
   - Plugin runner version, for issues related to plugin runners:
   - LuaRocks version, for installation issues (run `luarocks --version`):
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to