During a recent DNS outage, TCP connections from HAProxy to DNS resolvers spiked. Even after DNS recovered, connections stayed ~3,000 (normal is ~2). Although dns_process_idle_exp should clean up idle sessions, it didn’t in this state. tcpdump showed the DNS server closing idle connections after ~30s, but HAProxy immediately reopened them. Restarting HAProxy was required to return to normal connection counts. Setting maxconn on resolvers (also could test pulling in this patch<https://github.com/haproxy/haproxy/commit/5288b39011b2449bfa896f7932c7702b5a85ee77>) mitigates the spike but not the post-recovery persistence.
Environment HAProxy version: HAProxy version 2.9.4-9839cb-6 2024/07/31 - https://haproxy.org/ Status: stable branch - will stop receiving fixes around Q1 2025. Known bugs: http://www.haproxy.org/bugs/bugs-2.9.4.html Running on: Linux 5.15.173.1-2.cm2 #1 SMP Fri Feb 7 02:18:38 UTC 2025 x86_64 HAProxy Config: defaults mode http timeout client 10s timeout connect 5s timeout server 10s timeout http-request 10s resolvers systemdns nameserver dns1 [email protected]:53 nameserver dns2 [email protected]:53 parse-resolv-conf timeout resolve 1s frontend one bind 127.0.0.1:8002 mode http use_backend myservers frontend stats bind ipv4@:8889,ipv6@:8889 stats enable stats uri /stats stats refresh 10s backend myservers mode http http-request return status 200 content-type "plain/text" string GOOD backend T1 balance roundrobin timeout connect 500ms option redispatch 2 server-template t1 1-2000 "my-domain.com:8446" ssl verify none resolvers systemdns init-addr none maxconn 2000 backend T2 balance roundrobin timeout connect 500ms option redispatch 2 server-template t2 1-2000 "my-domain2.com:8446" ssl verify none resolvers systemdns init-addr none maxconn 2000 How to reproduce: 1. Start HAProxy 2. Watch open tcp connections to dns servers: * while true; do echo "$(date): $(ss -tanp | grep -E '(X.X.X.X|X.X.X.Y)' | wc -l) connections"; sleep 3; done * Should only have 2 open tcp connections (1 per resolver) 3. Inject/simulate dns failure * sudo iptables -I OUTPUT -p tcp -d X.X.X.X--dport 53 -m state --sta te ESTABLISHED -m length --length 52: -j DROP * Observe tcp connections increase 4. Undo failure * sudo iptables -D OUTPUT -p tcp -d X.X.X.X--dport 53 -m state --sta te ESTABLISHED -m length --length 52: -j DROP * Observe that connections stay high Note: The iptables fault injection may not exactly match the original outage but produces similar symptoms. I tried with both the version deployed during the issue and latest and was able to reproduce with both. Observations Post-recovery, HAProxy maintains ~3k DNS TCP connections. DNS server closes idle sessions after ~30s; HAProxy immediately reopens them. Restarting HAProxy reduces connections to normal. Adding maxconn on resolvers prevents the spike, but not the sticky high count after recovery. Hypotheses dns_process_idle_exp<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L1083> target calculation max_active_conns<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L1116> is reset to cur_active_conns. After outage → recovery, an initial cleanup sets max_active_conns low; target<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L1097> becomes 0, skipping later idle cleanup cycles, leaving excess sessions open. Stuck “zombie” sessions Failure mode plus timeouts/TCP/DNS settings yields connections with onfly_queries == 0 and nb_queries > 1. When the DNS server/kernel kills the TCP connection and dns_session_release<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L914C13-L914C32> runs, HAProxy recreates the session, preserving a bad state and keeping counts high. Might be totally off here, and setting maxconn does mitigate this issue for us, but thought I'd reach out to see if anyone else had any thoughts. Thanks!

