hebbaa opened a new issue, #12767:
URL: https://github.com/apache/apisix/issues/12767
### Current Behavior
### Issue Description
We are observing a significant discrepancy within a single Apache APISIX pod
where the aggregated Prometheus metric for active connections
(apisix_nginx_http_current_connections{state="active"}) reports a much higher
number than the actual number of open TCP sockets reported by the operating
system's lsof command for all Nginx processes within that pod.
This over-reporting seems related to Nginx worker process recycling,
suggesting a flaw in how the nginx-lua-prometheus module or APISIX aggregation
handles statistics from older, draining worker processes.
The investigation is based on the following specific data observed within a
single APISIX Kubernetes pod:
**1. System Metrics via curl (Prometheus Endpoint)**
The endpoint reported these values for current connections:
```
apisix_nginx_http_current_connections{state="active"} 58193
apisix_nginx_http_current_connections{state="accepted"} 234773641
apisix_nginx_http_current_connections{state="handled"} 234773641
apisix_nginx_http_current_connections{state="waiting"} 58076
apisix_nginx_http_current_connections{state="writing"} 89
```
**2. Operating System File Descriptor Count (lsof)Running lsof across all
Nginx PIDs within that same pod returned a significantly lower number:**
```
lsof -iTCP -a -p $(pgrep nginx | tr '\n' ',') | wc -l
39385
```
**3. Process List (ps -eaf and stat)**
The process list revealed four total worker processes, with two running much
longer than the others, indicating a configuration reload event occurred:
```
# Process list output subset
PID USER TIME COMMAND
51 apisix 1d06 {openresty} nginx: worker process # Long running (~1
day CPU time)
52 apisix 1d06 {openresty} nginx: worker process # Long running (~1
day CPU time)
114 apisix 33:48 {openresty} nginx: worker process # Short running (~34
minutes CPU time)
116 apisix 23:32 {openresty} nginx: worker process # Short running (~24
minutes CPU time)
# Timestamp of new worker process 114 start time
stat /proc/114
Change: 2025-11-23 09:38:44.660489426 +0000
```
**4. System Configuration**
Configured Limit: worker_connections 10620;
Total Expected Max Connections (Per Pod): 4 workers * 10620
connections/worker = 42,480 total connections.
### Analysis of Observations and Inconsistencies
We have identified two primary inconsistencies based on this data:
Inconsistency 1: Metric Count vs. OS Count
The Nginx metric reports ~58,193 active connections.
The actual operating system reality (lsof) reports ~39,385 open TCP sockets
(file descriptors).
The lsof count is the ground truth for actual resource consumption. The
Nginx metric is overstating the active connection count by nearly 19,000
connections (roughly the number of active connections 2 worker process might
hold).
Inconsistency 2: Active Connections vs. Configured Limit
The OS-level count of ~39k is actually within the configured maximum of
42,480. However, the Nginx metric reports ~58k, suggesting the system is vastly
exceeding its limits, when in reality, it is not.
Conclusion on the Discrepancy
The most logical explanation for these observations is a bug in how the
APISIX Prometheus metrics module aggregates statistics across different
generations of Nginx worker processes within the same pod.
When Nginx reloads gracefully (starting PIDs 114 and 116 while PIDs 51 and
52 drain), the metrics module seems to be incorrectly summing the statistics of
both the old, draining workers and the new, active workers simultaneously,
resulting in a misleadingly high "active" count that doesn't reflect the actual
consumed file descriptors shown by lsof.
The true number of active connections for this pod is closer to 39,385.
### Expected Behavior
The true number of active connections for this pod is closer to 39,385.
### Error Logs
_No response_
### Steps to Reproduce
Observed in production deployment
### Environment
- APISIX version (run `apisix version`):3.14.1
- Operating system (run `uname -a`):Linux data-plane-7c4b955754-dp552
5.10.230-223.885.amzn2.x86_64 #1 SMP Tue Dec 3 14:36:00 UTC 2024 x86_64
GNU/Linux
- OpenResty / Nginx version (run `openresty -V` or `nginx -V`):nginx
version: openresty/1.27.1.2 (x86_64-pc-linux-gnu)
built by gcc 15.2.0 (Wolfi 15.2.0-r3)
built with OpenSSL 3.6.0 1 Oct 2025
TLS SNI support enabled
- etcd version, if relevant (run `curl
http://127.0.0.1:9090/v1/server_info`):
- APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run `luarocks --version`):
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]