Re: Sporadic Timeouts after upgrading to bind9.20
> On our production name servers we have check every 30s if bind > is alive by sending a SOA query to bind. Today I upgraded a few > nodes from 9.18.x (x between 17 and 27) to 9.20.1 (Ubuntu 24.04 > with packages from ISC ppa). > > Since that, we have sporadic timeouts (3s). On the nodes with > more qps we see it more often. > > Before I dig into the problem, are there any specific changes > to 9.20 that I should look at? Maybe some default value changes > for socket buffers, thread handling ...? I can't answer specifically about BIND 9.20, I'm currently tipping my toes carefully into the waters of "deploying BIND 9.20 as a recursor". What you don't say anything about is whether you see increased CPU load on your hosts, and whether the relationship between QPS and CPU load has changed after upgrading to 9.20. Also, what general level of load do you observe on this / these host(s)? E.g. "how close to the limit of what it can do" are you? In our deployment, we monitor the relationship between the number of "udp: dropped due to full socket buffers" and "udp: datagrams received" (in our case via collectd / graphite / grafana), and when we started doing that we found out that we needed to bump the default UDP socket buffers quite a bit to get that event rate to go down to acceptable rates. Regrettably, as far as I know, BIND does not have a knob to adjust the socket buffer size for the UDP sockets BIND itself use, so what I ended up doing was bumping the default for UDP sockets the entire host via sysctl. In my case that's "fine" because the host is basically only serving this single function. Then again, I'm the weirdo running BIND on NetBSD, so the defaults are probably widely different in your case. Just an example from one of our publishing (non-recursive) BIND servers, from "netstat -s" output: udp: 1669688117 datagrams received 0 with incomplete header 10 with bad data length field 994 with bad checksum 10922 dropped due to no socket 874709 broadcast/multicast datagrams dropped due to no socket 890955 dropped due to full socket buffers 1667910527 delivered 2741883224 PCB hash misses 1632037948 datagrams output which comes out to 0.05% as an overall average "drops due to full socket buffers", but that doesn't mean there are occasional (smallish) spikes in the rate, of course. And this is with BIND 9.18.29. In other words: I think more information is needed to help you diagnose the issue. Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
> On Mon, Aug 26, 2024 at 06:05:19PM +0200, Havard Eidnes via bind-users wrote: >> Thanks. I found it, and it's more than a little embarassing. >> >> This is what you get when not building with --with-libxml2: an >> "un-rendered" xsl file as a result, in essence just the content >> of bin/named/xsl.c. And this happened because I wasn't paying >> attention to what options were turned on by default for the >> package I was putting together. "Surely stats is on by default!" >> Not so. (Well, I didn't even think it was optional.) Lesson >> learned. > > It *is* on by default, if it can find libxml2. Does yours live in > a nonstandard location? Time for more confessions. This is in NetBSD's pkgsrc, which only builds with explicitly "buildlinked" libraries, so that build dependencies are explicitly declared, and not automatically picked up from those you just accidentally happen to have installed on the build host. What I had overlooked was that I in /etc/mk.conf needed PKG_OPTIONS.bind+= bind-xml-statistics-server It's another matter whether this one should default to "on" in the package itself -- I'm leaning in that direction, but need to discuss with some others before I change the default. And I also need the "dnstap" option in my deployment, so I need a custom build anyway. Like I said, "lesson learned". > Perhaps, if libxml2 and libjson-c are both unavailable, we should > disable statistics-channels in the configuration - at least that way > the problem would've been easier to figure out. Right, I was sort of thinking in that direction as well, but would not be too insistent on something along those lines. Perhaps return a web page saying "built without both libjson-c and libxml2, so nothing to see here"? Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
> If I was debugging this I would: > - compared strace output from working and non-working server I did parts of that, ref. that other message I sent. > Unfortunately you are the only person who reported this problem and I > can't reproduce it either, so it's probably up to you to find needle > in the haystack. Good luck! Thanks. I found it, and it's more than a little embarassing. This is what you get when not building with --with-libxml2: an "un-rendered" xsl file as a result, in essence just the content of bin/named/xsl.c. And this happened because I wasn't paying attention to what options were turned on by default for the package I was putting together. "Surely stats is on by default!" Not so. (Well, I didn't even think it was optional.) Lesson learned. Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
BTW, I got an off-line question how the chrooting is done in my case, i.e. whether the "chroot" program is used, or the "-t" option to BIND is used. In my case it's the latter: -t directory This option tells named to chroot to directory after processing the command-line arguments, but before reading the configuration file. Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
Hi, and thanks for the suggestions. This is not an issue of broken clocks, all the involved machines run ntp and have good sync status traceable to at least a GPS clock. This does however appear to have something to do with the chroot'edness of this particular installation, and it's evident that "something is missing" in the chroot, and that this "something" is a run-time dependency of some sort. I have one installation of 9.20.0 which doesn't run in a chroot, and there the stats are rendering properly in my firefox browser (there's some oddity with the graphics display in Chrome, will bring that up separately). Ktracing the start of the response to the statistics reports reveals a marked difference: The chroot'ed system's first few line of output: 12931 12931 namedGIO fd 1028 wrote 4088 bytes "HTTP/1.1 200 OK\r\nContent-Type: text/xslt+xml\r\nDate: Mon, 26 Aug 20\ 24 08:05:10 GMT\r\nExpires: Mon, 26 Aug 2024 08:05:10 GMT\r\nLast-Modi\ fied: Sat, 24 Aug 2024 19:22:20 GMT\r\nCache-Control: public\r\nServer\ : libisc\r\nContent-Length: 39276\r\n\r\n\n\n\nhttp://www.w3.org/1999/xhtml\ \" version=\"1.0\">\n \n \n \n \ \n<\ while the non-chroot'ed system outputs: 861861 namedGIO fd 35 wrote 4088 bytes "HTTP/1.1 200 OK\r\nContent-Type: text/xml\r\nDate: Mon, 26 Aug 2024 08\ :15:10 GMT\r\nExpires: Mon, 26 Aug 2024 08:15:10 GMT\r\nLast-Modified:\ Mon, 26 Aug 2024 08:15:10 GMT\r\nPragma: no-cache\r\nCache-Control: n\ o-cache\r\nServer: libisc\r\nContent-Length: 38449\r\n\r\n\n\n20\ 24-08-16T17:12:39.761Z2024-08-26T07:44:26.863\ Z2024-08-26T08:15:10.620Z9.20.04534240<\ /counter>0Traffic Size\n\nServer Status\n\n \ \nBoot time:\n \ \n \n \ \n \n \n \ Last reconfigured:\n\n \n\n \ \n \nCurrent time:<\ /th>\n\n \n\n \n \nServer version:\n\n \ \n\n \n\n\nhttps://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
>> Hi Håvard. >> Have you tried a different browser? > > Not yet. Will do tomorrow. Latest Chrome on MacOS: just the same; it displays the raw XML which isn't exactly user-friendly. Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
Looking a bit further, I find in the XML output: Server Status Boot time: So no actual value? Is there a required post-processing step which is omitted? I see xsl is mentioned both here and in the style definition at the start of the XML output. I am however way too unfamiliar with the various XML-related tools to tell which piece is either missing or mal-functioning. This particular name server instance is running in a chroot, so naturally no external xsl processor is available (but surely BIND doesn't do it that way).. However, I don't find any "stray" references to XSLTPROC in the code, so in case that transformation is supposed to be done in some way, it must be done by some other method. My libxml2 is version 2.12.8, and is accepted by configure. Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND statistics
> Hi Håvard. > Have you tried a different browser? Not yet. Will do tomorrow. > Having said that, I just started 9.20.0 with this config: > > statistics-channels { inet 127.0.1.0 port 8080 ; }; > > Then pointed three different browsers at that address/port and it looks > fine to me in all of them. > Browers tried were Chrome, Safari and Firefox. > > I can't reproduce your issue, sorry. OK, thanks for checking anyway, will do more testing. Regards, - Håvard -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
BIND statistics
Hi, I'm mostly running BIND 9.18.x, and have configured statistics publishing via statistics-channels { inet 127.0.0.1 port 8053 allow { 127.0.0.1; }; inet "actual-address" port 8053 allow { prefix1/24; prefix2/24; }; }; I've started testing 9.20.x. I see BIND 9.20.x stats publishing is ... different. If I use firefox and visit http://actual-address:8053/ with BIND 9.18.x, I get a reasonably rendered HTML display which is easy to view. Not so for BIND 9.20.x; I get an XML document which firefox (in this particular case version 120.0) informs me at the top This XML file does not appear to have any style information associated with it. The document tree is shown below. and the document starts with https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"/>