Re: Strange recursor response time pattern
> It would be difficult, and possibly impossible, to continue to > process queries and format a report on queries simultaneously > without losing information in the report. To have a separate > thread creating the report, it might have to stop query > processing, take a snapshot of the data at that point in time, > save it somewhere, restart query processing, and then format > the report from the saved data. In this case, there would be a > brief interval when name could not handle queries. > > One might have to write a prototype to determine how long that > interruption would take. Possibly, yes. I can't presently answer to the internal workings of BIND to give an educated guess as to how much time the "take snapshot of stats raw data" function would take, and as you say that might need to be prototyped. However, it can hardly be worse than what is there today, where the stats processing thread gets queries queued to it while it's busy doing all the stats processing, and meanwhile not answering any of the queued queries. I'll note that unbound, when subjected to execution of "unbound-control stats" every 10 seconds does not behave the way BIND does. Admittedly, the number of stats values you get from unbound is far, far smaller than what we get from BIND. Regards, - Håvard ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Strange recursor response time pattern
It would be difficult, and possibly impossible, to continue to process queries and format a report on queries simultaneously without losing information in the report. To have a separate thread creating the report, it might have to stop query processing, take a snapshot of the data at that point in time, save it somewhere, restart query processing, and then format the report from the saved data. In this case, there would be a brief interval when name could not handle queries. One might have to write a prototype to determine how long that interruption would take. Charles Elliott -Original Message- From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Havard Eidnes Sent: Wednesday, September 6, 2017 8:40 AM To: m...@conundrum.com Cc: bind-us...@isc.org Subject: Re: Strange recursor response time pattern >> Is that pulling the old-style stats file, or the HTTP-based stats channel? As should be evident from my other message, this is using the HTTP-based stats channel. > If the latter... the zone list (and by extension the root > document) seems to take a long time to process, and involves some sort > of locking that blocks all query processing while the list is being > generated. We encountered this on a 3+ million zone instance.. BIND > would stop answering queries for several minutes if anyone requested > the root stats document or the zone list. Since this name server is approximately a pure recursive resolver, the list of authoritative zones is short, in fact only 3 configured zones ("localhost", "127.in-addr.arpa" and the corresponding for IPv6 loopback), and then there's the "automatic" zones in addition, but still, the halting of query processing while the list of zones is processed should not be an issue here. That said, I'm also rather baffled that BIND would have to stop processing all queries while traversing the zone instances; that certainly seems to have an excessive effect on normal operations. > As Ray says, you may be better off individually querying each of the > other documents and processing those rather than polling the root doc > to get them all in one shot. It's not "me" who is doing the querying, it's the collectd software. In the syscall trace, I see indeed that it is asking for the root document: GET / HTTP/1.1 Host: localhost:8053 User-Agent: collectd/5.7.2 Accept: */* However, your advice to query the separate documents in individual requests would: * require a rewrite of the BIND module in collectd * still not entirely get rid of the problem that some queries are put on hold while the stats channel data is processed and sent Looking at the system call trace shows me that other BIND threads do process DNS queries while this single thread which does the HTTP handling does not. Hence my suggestion to instead use a dedicated thread for the stats / HTTP handling. Oh, BTW, it also seems that BIND in my case wastes 15ms doing needless getsockname() syscalls on FD's which are invalid as part of the early stages of stats processing: 5645 17 named1504698577.991440645 CALL getsockname(0x,0x7f7fef1f06e0,0x7f7fef1f069c) 5645 17 named1504698577.991446511 RET getsockname -1 errno 9 Bad file descriptor (repeated lots of times). Regards, - Håvard ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
>> Is that pulling the old-style stats file, or the HTTP-based stats channel? As should be evident from my other message, this is using the HTTP-based stats channel. > If the latter... the zone list (and by extension the root > document) seems to take a long time to process, and involves > some sort of locking that blocks all query processing while the > list is being generated. We encountered this on a 3+ million > zone instance.. BIND would stop answering queries for several > minutes if anyone requested the root stats document or the zone > list. Since this name server is approximately a pure recursive resolver, the list of authoritative zones is short, in fact only 3 configured zones ("localhost", "127.in-addr.arpa" and the corresponding for IPv6 loopback), and then there's the "automatic" zones in addition, but still, the halting of query processing while the list of zones is processed should not be an issue here. That said, I'm also rather baffled that BIND would have to stop processing all queries while traversing the zone instances; that certainly seems to have an excessive effect on normal operations. > As Ray says, you may be better off individually querying each > of the other documents and processing those rather than polling > the root doc to get them all in one shot. It's not "me" who is doing the querying, it's the collectd software. In the syscall trace, I see indeed that it is asking for the root document: GET / HTTP/1.1 Host: localhost:8053 User-Agent: collectd/5.7.2 Accept: */* However, your advice to query the separate documents in individual requests would: * require a rewrite of the BIND module in collectd * still not entirely get rid of the problem that some queries are put on hold while the stats channel data is processed and sent Looking at the system call trace shows me that other BIND threads do process DNS queries while this single thread which does the HTTP handling does not. Hence my suggestion to instead use a dedicated thread for the stats / HTTP handling. Oh, BTW, it also seems that BIND in my case wastes 15ms doing needless getsockname() syscalls on FD's which are invalid as part of the early stages of stats processing: 5645 17 named1504698577.991440645 CALL getsockname(0x,0x7f7fef1f06e0,0x7f7fef1f069c) 5645 17 named1504698577.991446511 RET getsockname -1 errno 9 Bad file descriptor (repeated lots of times). Regards, - Håvard ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
>> The stats channel output relating to running tasks and memory contexts >> is very extensive. > > Either way I would not have expected use of the statistics > channel to negatively impact the query performance. Is the query > channel processed with "no-delay", so that a thread doesn't get > stuck waiting for data to drain from the other end? I've done some further observations, by doing a system call trace on named when it processes a HTTP request on the stats channel. In the sample I've looked at, BIND uses ~258ms between receiving the HTTP request until it's finished writing the statistics data over HTTP. Most of the time (236ms) is consumed between receiving the request and the start of pushing the result over HTTP. The thread which writes the statistics data does not process any DNS client requests between receiving the HTTP request and finishing sending the HTTP result. It looks like the thread has filled the TCP send window around 12 times while sending the HTTP result, and the OS puts the thread in the "lwp_park" state around 1-2ms each time this happens. Shortly after the stats data has been sent, the very same thread proceeds to process DNS client requests. I'm guessing that DNS client queries has continued to be dispatched to the thread while it's processing the HTTP request. Perhaps the HTTP / stats processing ought to have been done in a separate thread which isn't also involved in DNS client request handling? (No, I do not really know how easy or difficult that is to make happen...) Best regards, - Håvard ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
>> some further local discussion has made me aware that us running >> "collectd" for monitoring BIND may be contributing to the >> problem; collectd fetches data each 10s by using the BIND- >> configured statistics-channel, thus BIND is processing a TCP >> connection to deliver the statistics data. >> >> It's still somewhat surprising and disappointing that this should >> interfere this much with DNS query processing... > > There are various URLs (see the BIND 9 ARM) that provide a subset of the > full statistics. > > The stats channel output relating to running tasks and memory contexts > is very extensive. Either way I would not have expected use of the statistics channel to negatively impact the query performance. Is the query channel processed with "no-delay", so that a thread doesn't get stuck waiting for data to drain from the other end? > If collectd doesn't need the full set, you may be able to ask for just > the traffic-volume related subset(s). I'm using the "bind" plugin in collectd in its default configuration. The code for the BIND plugin of collectd is at https://github.com/collectd/collectd/blob/master/src/bind.c ...and it's not me who's written it. We use grafana as the display frontend, and the choice of what data to graph is made through that interface. Therefore it may not be as easy to restrict what data is fetched via the stats channel. I'm running BIND on a 16-core AMD Opteron 6274 processor, and BIND is running a thread on each core. The machine has 16G memory, and is not starved in that department either. I'm therefore still rather surprised about the impact of using the stats channel. The other machines in my test are similarly or better spec'ed, Xeon E5-2640 (16 cores), 32G memory, and they are also not resource-starved. Regards, - Håvard ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
On 5 September 2017 at 11:56, Havard Eidneswrote: > Hmm... > > some further local discussion has made me aware that us running > "collectd" for monitoring BIND may be contributing to the > problem; collectd fetches data each 10s by using the BIND- > configured statistics-channel, thus BIND is processing a TCP > connection to deliver the statistics data. > > It's still somewhat surprising and disappointing that this should > interfere this much with DNS query processing... > > Is that pulling the old-style stats file, or the HTTP-based stats channel? If the latter... the zone list (and by extension the root document) seems to take a long time to process, and involves some sort of locking that blocks all query processing while the list is being generated. We encountered this on a 3+ million zone instance.. BIND would stop answering queries for several minutes if anyone requested the root stats document or the zone list. As Ray says, you may be better off individually querying each of the other documents and processing those rather than polling the root doc to get them all in one shot. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
On 05/09/2017 16:56, Havard Eidnes wrote: > Hmm... > > some further local discussion has made me aware that us running > "collectd" for monitoring BIND may be contributing to the > problem; collectd fetches data each 10s by using the BIND- > configured statistics-channel, thus BIND is processing a TCP > connection to deliver the statistics data. > > It's still somewhat surprising and disappointing that this should > interfere this much with DNS query processing... There are various URLs (see the BIND 9 ARM) that provide a subset of the full statistics. The stats channel output relating to running tasks and memory contexts is very extensive. If collectd doesn't need the full set, you may be able to ask for just the traffic-volume related subset(s). Ray ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
Hmm... some further local discussion has made me aware that us running "collectd" for monitoring BIND may be contributing to the problem; collectd fetches data each 10s by using the BIND- configured statistics-channel, thus BIND is processing a TCP connection to deliver the statistics data. It's still somewhat surprising and disappointing that this should interfere this much with DNS query processing... Best regards, - Håvard ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Strange recursor response time pattern
On 05.09.17 16:44, Havard Eidnes wrote: It seems that every 10s, "on the clock", BIND will temporarily increase the query response time rather drastically for a short while, only to settle down to normal behaviour until the next 10s event. Any idea what might be causing this? Anything I can do to remedy the behaviour? BIND keeps track of servers it's sending queries to and their RTTs to prefer sending queries to servers that respond more quickly. ocasionally, it re-sends queries even to slower or unreachable servers to see if the situation changed. what you see may be result of this algorithm. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. I just got lost in thought. It was unfamiliar territory. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Strange recursor response time pattern
Hi, one of my users made me aware of this rather unexpected behaviour in our recursors running BIND 9.10.5-P3 on NetBSD/amd64: It seems that every 10s, "on the clock", BIND will temporarily increase the query response time rather drastically for a short while, only to settle down to normal behaviour until the next 10s event. A test I did to replicate the result was to run a shell loop with a small sub-second sleep (yes, nonstandard): while sleep .1; do date +%s dig @$recursor $query A +noadd +noauth done > /tmp/out Do this with a query for a name which exists and is in-cache. Let it run for a while. Then do: awk '/Query time:/ { qt=$4 } /^1/ { print $1 " " qt }' /tmp/out > /tmp/qt Plot the result and see a rather wide variation in response time. Next do awk '{ print $1 % 10 " " $2 }' /tmp/qt > /tmp/bins and put the result through a small perl script (attached) to give some basic statistics, and in my case I see: Ixmean stddev median min max 0 0.083303 0.555366 0.00 0 16 1 0.071531 0.353116 0.00 0 11 2 0.130875 3.170423 0.00 0 166 3 0.075922 0.467403 0.00 0 18 4 0.167146 3.928216 0.00 0 167 5 0.069928 0.357949 0.00 0 11 6 0.148990 3.536247 0.00 0 183 7 0.090478 1.786360 0.00 0 92 8 47.902799 109.998464 0.00 0 479 9 0.080216 0.373228 0.00 0 12 This I find rather surprising! This measurement is based on some 27k samples, measured against a BIND which processes 6-800 qps, so not overly busy, but similar behaviour is also observed with BINDs which process only 50-150 qps, so the query load does not appear to affect this significantly -- the result for the two other BIND installations I've looked at were (based on 24k and 20k samples): Ixmean stddev median min max 0 15.031710 0.736819 15.00 14 46 1 15.041801 1.359741 15.00 14 77 2 15.003519 0.248714 15.00 14 20 3 15.008326 0.258005 15.00 14 19 4 15.018990 0.447535 15.00 14 30 5 15.007415 0.289008 15.00 14 20 6 31.504860 42.598276 15.00 14 307 7 15.128158 3.613397 15.00 14 133 8 15.055274 2.329867 15.00 14 123 9 15.011628 0.305681 15.00 14 20 and Ixmean stddev median min max 0 9.015588 2.394387 9.00 8 121 1 8.994696 2.334736 9.00 8 124 2 8.966915 0.985907 9.00 8 55 3 8.998417 2.259316 9.00 8 122 4 24.782942 42.980507 9.00 8 265 5 8.961137 0.363112 9.00 8 17 6 8.947259 0.225413 9.00 8 10 7 9.048827 3.174344 9.00 8 128 8 9.004134 2.429312 9.00 8 128 9 9.010314 2.294347 9.00 8 121 Any idea what might be causing this? Anything I can do to remedy the behaviour? Best regards, - Håvard #! /usr/bin/perl use strict; use Statistics::Descriptive; our(@d); while(<>) { chomp; @_ = split; my($i) = $_[0]; my($v) = $_[1]; if (!defined($d[$i])) { $d[$i] = (); } push(@{ $d[$i] }, $v); } printf("Ixmean stddev median min max\n"); foreach my $i (0 .. $#d) { my $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@{ $d[$i] }); my $mean = $stat->mean; my $stddev = $stat->standard_deviation; my $median = $stat->median; my $min = $stat->min; my $max = $stat->max; printf("%d %f %f %f %d %d\n", $i, $mean, $stddev, $median, $min, $max); } ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users