On Jul 17, 6:09 am, "Jason Bratton" <[EMAIL PROTECTED]> wrote: > Hello all, > > Like many of you, I recently upgraded all of our caching nameservers. > Since we were already running BIND 9.4.2, I chose to upgrade to 9.4.2-P1. > After the upgrade, I started receiving complaints of DNS queries that were > truncated and retried over TCP failing. > > It appears that BIND is limiting the number of open TCP connections to ~ > 100 per IP address it listens on. For example, on one of our caching > nameservers: > > cachens-4:~# netstat -an | grep tcp | grep 72.3.128.240 | wc -l > 99 > cachens-4:~# netstat -an | grep tcp | grep 72.3.128.241 | wc -l > 105 > > From an rndc status: > > tcp clients: 0/1000 > > Almost all (~99%) of the TCP connections in the above netstat are at a > SYN_RECV state. My guess would be customer servers that have bad firewall > rules, but in any case, it's really not relevant to this particular > problem because nothing has changed except for the upgrade from 9.4.2 to > 9.4.2-P1. I didn't change the named.conf or anything, and as you can see, > tcp-clients is set to 1000. > > Did something change in the source code that would cause this? I'm > thinking a listen() call with backlog set to 100 that wasn't setup that > way previously? Something interesting to me is that the ARM specifies the > default for tcp-clients to be 100, but maybe that is a coincidence. > > FWIW, SOMAXCONN is set to 128 on my servers. Prior to this patch, I was > using a Debian packaged version of 9.4.2, so maybe they had it set higher? > I looked all through the source and changes made by Debian to 9.4.2 and > couldn't find anything to indicate this is the case. > > I'm open for suggestions! This a Debian Etch box running kernel 2.6.18 on > an x86_64 architecture. Thanks. > > -- Jason > > Confidentiality Notice: This e-mail message (including any attached or > embedded documents) is intended for the exclusive and confidential use of the > individual or entity to which this message is addressed, and unless otherwise > expressly indicated, is confidential and privileged information of Rackspace. > Any dissemination, distribution or copying of the enclosed material is > prohibited. > If you receive this transmission in error, please notify us immediately by > e-mail > at [EMAIL PROTECTED], and delete the original message. > Your cooperation is appreciated.
I am experiencing a similar issue with vendor supplied bind with 9.4.2- p1 fixes: QDDNS 4.1 Build 6 - Lucent DNS Server (BIND 9.4.1-P1), Copyright (c) 2008 Alcatel-Lucent + Includes security fixes from BIND 9.4.2-P1 It all started with a complaint that a query was failing on one of our 15 internal DNS servers. All 15 servers were recently deployed and were identical in configuration. When I looked into the issue, I noticed that the query generated a response which was truncated and then reattempted using TCP. I then tested queries against the problematic server using "dig +tcp" and discovered that all DNS queries using TCP were failing on this server. netstat showed lots of connections in SYN_RECV. Since the same symptoms were encountered before when our firewall team misconfigured rules, I then checked to see if this was the cause. I got on the problematic server and issued queries to itself using TCP. In doing so, I noticed something very strange. A "dig +tcp somehost.domain.com @127.0.0.1" would succeed with no issues while a "dig +tcp somehost.domain.com @ip.of.the.server" would result in: ; <<>> DiG 9.4.1-P1 <<>> +tcp xxxx.xxxx.xxxx @xxx.xxx.xxx.xxx ; (1 server found) ;; global options: printcmd ;; connection timed out; no servers could be reached I am still waiting for the vendor to accept this is not a firewall issue since I can reproduce this by query the server from itself.
