Re: [vpp-dev] VCL/LDP epoll notification delays and stability issues with Nginx reverse proxy behind SSLO

Florin Coras via lists.fd.io Mon, 29 Jun 2026 14:09:53 -0700

Hi Guo, 

Inline.


> On Jun 29, 2026, at 3:08 AM, Guo Huiliang via lists.fd.io 
> <[email protected]> wrote:
> 
> Hi VPP community,
> I'm running a custom VPP build with SSLO plugin and Nginx as reverse proxy 
> behind VCL/LDP, and I'm encountering several interrelated stability issues. 
> I'd appreciate any insight from the community on whether these are known 
> limitations or if there are workarounds I'm missing.
> Architecture
> 
> Client ──TLS──→ [VPP SSLO plugin] ──→ loop0 ──→ Nginx-decrypt (VCL namespace 
> v10)
>                                                        │
>                                                  HTTP (cleartext)
>                                                        │
>                                                        ↓
>                                                   loop1 ──→ Nginx-encrypt 
> (VCL namespace v20)
>                                                        │
>                                                  TLS re-encrypt
>                                                        │
>                                                        ↓
>                                                   [VPP SSLO plugin] ──TLS──→ 
> Backend :443
I’m assuming SSLO is a custom plugin, but from the looks of it it has no impact 
on tcp termination for Nginx. 

> VPP: 23.06  custom build with SSLO plugin, DPDK 4-worker, 8G main heap

Would it be possible to test with a newer vpp as this is 3 years old at this 
point. 
 
> Nginx-decrypt: TLS termination, bind to loop0 (172.16.102.213)
> Nginx-encrypt: receives cleartext HTTP on loop1 (172.16.102.214:9003 
> <http://172.16.102.214:9003/>), proxy_pass https:// to backends
> Both Nginx instances: master_process off, worker_processes 1, LD_PRELOAD 
> libvcl_ldpreload.so
> VPP session: preallocated-sessions 500000, 4 worker threads
> Issue 1: VCL epoll event notification delay (5+ seconds)
> 
> Symptom: Nginx-decrypt returns 504 Gateway Timeout to clients despite backend 
> responding promptly.
> Evidence from SSLO-decrypted packet capture between decrypt and encrypt Nginx:
> At T=42.058s: Backend sends ~100KB of HTTP response data (30+ segments, 1518 
> bytes each) within 50ms
> TCP ACKs are returned immediately at the VPP/TCP layer (sub-millisecond)
> At T=47.110s: Backend sends FIN, ACK (5 seconds after the last data segment)
> During this 5-second gap, no VCL events were delivered to Nginx-encrypt's 
> epoll
> The backend clearly responded within ~8 seconds total, but the VCL layer 
> introduced a 5-second delay in delivering the readable-event to 
> Nginx-encrypt's epoll. This pushed the total round-trip past Nginx-decrypt's 
> proxy_read_timeout (8s), causing the 504.
> VCL config for the affected Nginx-encrypt instance:
> vcl {
>     huge_page
>     heapsize 512M
>     segment-size 268435456
>     add-segment-size 268435456
>     rx-fifo-size 524288
>     tx-fifo-size 524288
>     event-queue-size 100000
>     app-timeout 10.0
>     session-timeout 30.0
>     use-mq-eventfd
>     app-socket-api /run/vpp/app_ns_sockets/v20
>     namespace-id v20
> }
> The Nginx-decrypt instance uses event-queue-size 500000 and does not exhibit 
> the same delay.

Hard for VCL to delay events for that long. So either events were missed 
because mq was overrun or ignored in vcl. Scale matters here, if there are 
close to 0.5M sessions in vcl/nginx and there’s a lot of churn, increasing 
event-queue-size might help.

Separate suggestion, maybe increase segment size as well given the scale of the 
test. Use at least 1G. 

> Questions:
> Is event-queue-size 100000 too small for a proxy handling large HTTP 
> responses (30+ TCP segments per response)?

HTTP response size should not directly matter. What matters is how many 
applications have ctrl/io events at one point in time and how much time nginx 
needs to drain them. 

> Are there known VCL epoll delivery latency issues when the event queue is 
> near capacity?

Not as far as I know. 

> Would increasing rx-fifo-size help buffer bursty response data while waiting 
> for epoll events?
Regarding mq issues, no. In general larger fifos do help with networking 
latency, but from event delivery perspective, it should not directly matter. 
> Issue 2: epoll_wait() returns no events without timeout (Nginx crashes)
> 
> Symptom: At least one Nginx instance crashes with:
> [alert] PID#0: epoll_wait() returned no events without timeout
> ldp_destructor:2913: LDP<PID>: LDP destructor: done!
> This occurs under moderate load (estimated ~2000 QPS) after variable uptime 
> (minutes to hours). Both Nginx instances are affected, and the crash can 
> cascade when one side goes down.
> This happens with both use-mq-eventfd enabled and disabled across different 
> test iterations.
> Questions:
> Is this a known race condition in VCL's epoll simulation layer?

Not known, but this sounds like timeout might not have been respected. I think 
we had changes in that area, but hard to say if the issue was solved. For 
instance, we now handle EINTR. 

> Are there recommended VCL parameters (event-queue-size, session-timeout, 
> app-timeout) to mitigate this?
> Could the number of active sessions (TIME-WAIT from short-lived proxy 
> connections) correlate with this failure?

I think these have to do with vppcom_epoll_wait_eventfd handling of linux epoll 
timeout, not configuration. 
> Issue 3: CLOSE-WAIT accumulation from missing EPOLLRDHUP
> 
> Symptom: vppctl show tcp stats shows thousands of timeout close-wait events, 
> with corresponding segments retransmitted in the hundreds of thousands. After 
> disabling upstream keepalive in Nginx (forcing short-lived connections), 
> close-wait timeouts dropped to near zero, but Nginx still experienced 
> epoll_wait failures (Issue 2).
> Root cause appears to be: VCL does not reliably deliver EPOLLRDHUP to Nginx's 
> epoll queue when the remote peer sends FIN, causing Nginx to reuse dead 
> connections from the keepalive pool and VPP to accumulate CLOSE-WAIT sessions 
> that eventually timeout.
> Questions:
> Is EPOLLRDHUP support in VCL considered production-ready?

Yes, it’s production ready from my perspective. If issue is reproducible with 
master we should fix it. 

> What is the recommended approach for Nginx + VCL with upstream keepalive?

It should just work. Could you try reproducing the issue with hs-test nginx 
tests?
> Issue 4: optname 31 (TCP_ULP / kTLS) unsupported in VCL setsockopt
> 
> Symptom: VCL debug log flooded with:
> ERROR: fd XX: setsockopt() SOL_TCP: vlsh X optname 31 unsupported!
> This is OpenSSL 3.x attempting setsockopt(fd, SOL_TCP, TCP_ULP, "tls", 3) to 
> enable kernel TLS (kTLS). VCL's setsockopt implementation has case statements 
> for TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_CONGESTION, and TCP_CORK, but not 
> TCP_ULP (31), so it falls to the default branch which logs an error.
> While setsockopt(TCP_ULP) failure is non-fatal (OpenSSL falls back to 
> user-space TLS), the high-frequency debug logging under load contributes to 
> unnecessary I/O. The fix would be to add case TCP_ULP: return -EOPNOTSUPP; 
> without the debug log.

Feel free to push the fix, or let us know if somebody else should do it. 
> Environment
> 
> VPP: 23.06 custom build with SSLO plugin
> Nginx: 1.28.0 (custom build with setsockopt SO_VCL_SET_PEER_INFO patch)
> OpenSSL: 3.x (linked with Nginx)
> OS: Ubuntu 22.04 LTS
> Kernel: 5.10.134
> Summary
> 
> The common thread across all issues is VCL's epoll event delivery reliability 
> under load. Data arrives at VPP's TCP layer correctly (verified by vppctl 
> pcap trace showing clean ACK sequences and no retransmissions on the VPP 
> side), but the events are either delayed (Issue 1), lost entirely (Issue 2), 
> or missing specific flags (Issue 3) when delivered to the application.
> Are these known issues with specific VPP versions? Are there configuration 
> guidelines for tuning VCL for high-throughput Nginx reverse proxy workloads?
> Thank you for any guidance.

Hope above helps. Let us know if things improve once you modify the configs. 

Regards, 
Florin

> 
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27091): https://lists.fd.io/g/vpp-dev/message/27091
Mute This Topic: https://lists.fd.io/mt/120027879/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] VCL/LDP epoll notification delays and stability issues with Nginx reverse proxy behind SSLO

Reply via email to