Hi,

after upgrading from BIND 9.20.21 to 9.20.23 on Debian 13, I am seeing a large 
accumulation of TCP connections in CLOSE_WAIT state to port 853 when forwarding 
queries to DoT upstream servers (tested with Cloudflare and DNS4EU).

After some time under normal load, "ss -tnp | grep 853 | awk '{print $1}' | 
sort | uniq -c | sort -rn" shows something like:

4465 CLOSE-WAIT
   2 ESTAB

Connections in CLOSE_WAIT accumulate continuously across all configured DoT 
upstream servers:

$ ss -tnp | grep 853 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort 
-rn

1321 86.54.11.200 (DNS4EU)
1203 86.54.11.100 (DNS4EU)
1080 1.1.1.1      (Cloudflare)
 861 1.0.0.1      (Cloudflare)

The same error pattern occurs for all domains, regardless of the queried domain 
or upstream server. Observed examples:

info: shut down hung fetch while resolving 0xXXXXXXXXX000(<ext-domain>/A)
debug 1: set ede: info-code 22 extra-text (null)
debug 1: client @0xXXXXXXXXX000 <client-ip>#56707 (<ext-domain>): rpz QNAME 
rewrite <ext-domain> stop on unrecognized qresult in rpz_rewrite() failed: 
SERVFAIL
debug 1: client @0xXXXXXXXXX000 <client-ip>#56707 (<ext-domain>): query failed 
(SERVFAIL) for <ext-domain>/IN/A at query.c:7860
debug 2: fetch completed for <ext-domain>/A in 12.000205: SERVFAIL/success 
[domain:.,referral:0,restart:1,qrysent:1,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
debug 3: client @0xXXXXXXXXX000 <client-ip>#56707 (<ext-domain>): send failed: 
operation canceled

query-errors: debug 1: client @0xXXXXXXXXX000 <client-ip>#56402 (<ext-domain>): 
rpz QNAME rewrite <ext-domain> stop on unrecognized qresult in rpz_rewrite() 
failed: SERVFAIL
query-errors: info: client @0xXXXXXXXXX000 <client-ip>#56402 (<ext-domain>): 
query failed (SERVFAIL) for <ext-domain>/IN/A at query.c:7860
query-errors: debug 2: fetch completed for <ext-domain>/A in 12.004205: 
SERVFAIL/success 
[domain:.,referral:0,restart:1,qrysent:0,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
 

Impact:
- Initially, there are none or only a few SERVFAIL errors; later, there are 
significantly more. In some cases, DNS becomes unusable
- Query timeouts of exactly 12 seconds before failure
- System accumulates thousands of zombie TCP connections
- Issue affects all configured DoT upstream providers simultaneously, ruling 
out an upstream-side issue

Downgrading to 9.20.21 fully resolves the issue.

Has anyone else seen this? Is there a configuration-level workaround that 
properly closes stale TLS connections? Or is this a bug?

Thanks

Dennis
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list.

Reply via email to