Re: Strange recursor response time pattern

2017-09-07 Thread Havard Eidnes
> It would be difficult, and possibly impossible, to continue to
> process queries and format a report on queries simultaneously
> without losing information in the report.  To have a separate
> thread creating the report, it might have to stop query
> processing, take a snapshot of the data at that point in time,
> save it somewhere, restart query processing, and then format
> the report from the saved data.  In this case, there would be a
> brief interval when name could not handle queries.
>
> One might have to write a prototype to determine how long that
> interruption would take.

Possibly, yes.  I can't presently answer to the internal workings of
BIND to give an educated guess as to how much time the "take
snapshot of stats raw data" function would take, and as you say that
might need to be prototyped.  However, it can hardly be worse than
what is there today, where the stats processing thread gets queries
queued to it while it's busy doing all the stats processing, and
meanwhile not answering any of the queued queries.

I'll note that unbound, when subjected to execution of
"unbound-control stats" every 10 seconds does not behave the way
BIND does.  Admittedly, the number of stats values you get from
unbound is far, far smaller than what we get from BIND.

Regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Strange recursor response time pattern

2017-09-07 Thread Charles Elliott
It would be difficult, and possibly impossible, to continue to process
queries and format a report on queries simultaneously without losing
information in the report.  To have a separate thread creating the report,
it might have to stop query processing, take a snapshot of the data at that
point in time, save it somewhere, restart query processing, and then format
the report from the saved data.  In this case, there would be a brief
interval when name could not handle queries.  One might have to write a
prototype to determine how long that interruption would take.

Charles Elliott

-Original Message-
From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of
Havard Eidnes
Sent: Wednesday, September 6, 2017 8:40 AM
To: m...@conundrum.com
Cc: bind-us...@isc.org
Subject: Re: Strange recursor response time pattern

>> Is that pulling the old-style stats file, or the HTTP-based stats
channel?

As should be evident from my other message, this is using the HTTP-based
stats channel.

> If the latter... the zone list (and by extension the root
> document) seems to take a long time to process, and involves some sort 
> of locking that blocks all query processing while the list is being 
> generated.  We encountered this on a 3+ million zone instance.. BIND 
> would stop answering queries for several minutes if anyone requested 
> the root stats document or the zone list.

Since this name server is approximately a pure recursive resolver, the list
of authoritative zones is short, in fact only
3 configured zones ("localhost", "127.in-addr.arpa" and the corresponding
for IPv6 loopback), and then there's the "automatic" zones in addition, but
still, the halting of query processing while the list of zones is processed
should not be an issue here.

That said, I'm also rather baffled that BIND would have to stop processing
all queries while traversing the zone instances; that certainly seems to
have an excessive effect on normal operations.

> As Ray says, you may be better off individually querying each of the 
> other documents and processing those rather than polling the root doc 
> to get them all in one shot.

It's not "me" who is doing the querying, it's the collectd software.  In the
syscall trace, I see indeed that it is asking for the root document:

GET / HTTP/1.1
Host: localhost:8053
User-Agent: collectd/5.7.2
Accept: */*

However, your advice to query the separate documents in individual requests
would:

 * require a rewrite of the BIND module in collectd
 * still not entirely get rid of the problem that some queries
   are put on hold while the stats channel data is processed and
   sent

Looking at the system call trace shows me that other BIND threads do process
DNS queries while this single thread which does the HTTP handling does not.
Hence my suggestion to instead use a dedicated thread for the stats / HTTP
handling.

Oh, BTW, it also seems that BIND in my case wastes 15ms doing needless
getsockname() syscalls on FD's which are invalid as part of the early stages
of stats processing:

  5645 17 named1504698577.991440645 CALL
getsockname(0x,0x7f7fef1f06e0,0x7f7fef1f069c)
  5645 17 named1504698577.991446511 RET   getsockname -1 errno 9 Bad
file descriptor

(repeated lots of times).

Regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to
unsubscribe from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-06 Thread Havard Eidnes
>> Is that pulling the old-style stats file, or the HTTP-based stats channel?

As should be evident from my other message, this is using the
HTTP-based stats channel.

> If the latter... the zone list (and by extension the root
> document) seems to take a long time to process, and involves
> some sort of locking that blocks all query processing while the
> list is being generated.  We encountered this on a 3+ million
> zone instance.. BIND would stop answering queries for several
> minutes if anyone requested the root stats document or the zone
> list.

Since this name server is approximately a pure recursive
resolver, the list of authoritative zones is short, in fact only
3 configured zones ("localhost", "127.in-addr.arpa" and the
corresponding for IPv6 loopback), and then there's the
"automatic" zones in addition, but still, the halting of query
processing while the list of zones is processed should not be an
issue here.

That said, I'm also rather baffled that BIND would have to stop
processing all queries while traversing the zone instances; that
certainly seems to have an excessive effect on normal operations.

> As Ray says, you may be better off individually querying each
> of the other documents and processing those rather than polling
> the root doc to get them all in one shot.

It's not "me" who is doing the querying, it's the collectd
software.  In the syscall trace, I see indeed that it is asking
for the root document:

GET / HTTP/1.1
Host: localhost:8053
User-Agent: collectd/5.7.2
Accept: */*

However, your advice to query the separate documents in
individual requests would:

 * require a rewrite of the BIND module in collectd
 * still not entirely get rid of the problem that some queries
   are put on hold while the stats channel data is processed and
   sent

Looking at the system call trace shows me that other BIND threads
do process DNS queries while this single thread which does the
HTTP handling does not.  Hence my suggestion to instead use a
dedicated thread for the stats / HTTP handling.

Oh, BTW, it also seems that BIND in my case wastes 15ms doing
needless getsockname() syscalls on FD's which are invalid as part
of the early stages of stats processing:

  5645 17 named1504698577.991440645 CALL  
getsockname(0x,0x7f7fef1f06e0,0x7f7fef1f069c)
  5645 17 named1504698577.991446511 RET   getsockname -1 errno 9 Bad 
file descriptor

(repeated lots of times).

Regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-06 Thread Havard Eidnes
>> The stats channel output relating to running tasks and memory contexts
>> is very extensive.
>
> Either way I would not have expected use of the statistics
> channel to negatively impact the query performance.  Is the query
> channel processed with "no-delay", so that a thread doesn't get
> stuck waiting for data to drain from the other end?

I've done some further observations, by doing a system call trace
on named when it processes a HTTP request on the stats channel.

In the sample I've looked at, BIND uses ~258ms between receiving
the HTTP request until it's finished writing the statistics data
over HTTP.  Most of the time (236ms) is consumed between
receiving the request and the start of pushing the result over
HTTP.

The thread which writes the statistics data does not process any
DNS client requests between receiving the HTTP request and
finishing sending the HTTP result.  It looks like the thread has
filled the TCP send window around 12 times while sending the HTTP
result, and the OS puts the thread in the "lwp_park" state around
1-2ms each time this happens.

Shortly after the stats data has been sent, the very same thread
proceeds to process DNS client requests.  I'm guessing that DNS
client queries has continued to be dispatched to the thread while
it's processing the HTTP request.

Perhaps the HTTP / stats processing ought to have been done in a
separate thread which isn't also involved in DNS client request
handling?  (No, I do not really know how easy or difficult that
is to make happen...)

Best regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Havard Eidnes
>> some further local discussion has made me aware that us running
>> "collectd" for monitoring BIND may be contributing to the
>> problem; collectd fetches data each 10s by using the BIND-
>> configured statistics-channel, thus BIND is processing a TCP
>> connection to deliver the statistics data.
>>
>> It's still somewhat surprising and disappointing that this should
>> interfere this much with DNS query processing...
>
> There are various URLs (see the BIND 9 ARM) that provide a subset of the
> full statistics.
>
> The stats channel output relating to running tasks and memory contexts
> is very extensive.

Either way I would not have expected use of the statistics
channel to negatively impact the query performance.  Is the query
channel processed with "no-delay", so that a thread doesn't get
stuck waiting for data to drain from the other end?

> If collectd doesn't need the full set, you may be able to ask for just
> the traffic-volume related subset(s).

I'm using the "bind" plugin in collectd in its default
configuration.

The code for the BIND plugin of collectd is at

  https://github.com/collectd/collectd/blob/master/src/bind.c

...and it's not me who's written it.  We use grafana as the
display frontend, and the choice of what data to graph is made
through that interface.  Therefore it may not be as easy to
restrict what data is fetched via the stats channel.

I'm running BIND on a 16-core AMD Opteron 6274 processor, and
BIND is running a thread on each core.  The machine has 16G
memory, and is not starved in that department either.  I'm
therefore still rather surprised about the impact of using the
stats channel.

The other machines in my test are similarly or better spec'ed,
Xeon E5-2640 (16 cores), 32G memory, and they are also not
resource-starved.

Regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Matthew Pounsett
On 5 September 2017 at 11:56, Havard Eidnes  wrote:

> Hmm...
>
> some further local discussion has made me aware that us running
> "collectd" for monitoring BIND may be contributing to the
> problem; collectd fetches data each 10s by using the BIND-
> configured statistics-channel, thus BIND is processing a TCP
> connection to deliver the statistics data.
>
> It's still somewhat surprising and disappointing that this should
> interfere this much with DNS query processing...
>
> Is that pulling the old-style stats file, or the HTTP-based stats channel?


If the latter... the zone list (and by extension the root document) seems
to take a long time to process, and involves some sort of locking that
blocks all query processing while the list is being generated.  We
encountered this on a 3+ million zone instance.. BIND would stop answering
queries for several minutes if anyone requested the root stats document or
the zone list.

As Ray says, you may be better off individually querying each of the other
documents and processing those rather than polling the root doc to get them
all in one shot.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Strange recursor response time pattern

2017-09-05 Thread Ray Bellis
On 05/09/2017 16:56, Havard Eidnes wrote:
> Hmm...
> 
> some further local discussion has made me aware that us running
> "collectd" for monitoring BIND may be contributing to the
> problem; collectd fetches data each 10s by using the BIND-
> configured statistics-channel, thus BIND is processing a TCP
> connection to deliver the statistics data.
> 
> It's still somewhat surprising and disappointing that this should
> interfere this much with DNS query processing...

There are various URLs (see the BIND 9 ARM) that provide a subset of the
full statistics.

The stats channel output relating to running tasks and memory contexts
is very extensive.

If collectd doesn't need the full set, you may be able to ask for just
the traffic-volume related subset(s).

Ray

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Havard Eidnes
Hmm...

some further local discussion has made me aware that us running
"collectd" for monitoring BIND may be contributing to the
problem; collectd fetches data each 10s by using the BIND-
configured statistics-channel, thus BIND is processing a TCP
connection to deliver the statistics data.

It's still somewhat surprising and disappointing that this should
interfere this much with DNS query processing...

Best regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Matus UHLAR - fantomas

On 05.09.17 16:44, Havard Eidnes wrote:

It seems that every 10s, "on the clock", BIND will temporarily
increase the query response time rather drastically for a short while,
only to settle down to normal behaviour until the next 10s event.



Any idea what might be causing this?  Anything I can do to remedy
the behaviour?


BIND keeps track of servers it's sending queries to and their RTTs to prefer
sending queries to servers that respond more quickly.

ocasionally, it re-sends queries even to slower or unreachable servers to
see if the situation changed.

what you see may be result of this algorithm.
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I just got lost in thought. It was unfamiliar territory. 
___

Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Strange recursor response time pattern

2017-09-05 Thread Havard Eidnes
Hi,

one of my users made me aware of this rather unexpected behaviour
in our recursors running BIND 9.10.5-P3 on NetBSD/amd64:

It seems that every 10s, "on the clock", BIND will temporarily
increase the query response time rather drastically for a short while,
only to settle down to normal behaviour until the next 10s event.

A test I did to replicate the result was to run a shell loop with
a small sub-second sleep (yes, nonstandard):

while sleep .1; do
   date +%s
   dig @$recursor $query A +noadd +noauth
done > /tmp/out

Do this with a query for a name which exists and is in-cache.
Let it run for a while.  Then do:

awk '/Query time:/ { qt=$4 } /^1/ { print $1 " " qt }' /tmp/out > /tmp/qt

Plot the result and see a rather wide variation in response time.

Next do

awk '{ print $1 % 10 " " $2 }' /tmp/qt > /tmp/bins

and put the result through a small perl script (attached) to give
some basic statistics, and in my case I see:

Ixmean   stddev   median min max
0 0.083303 0.555366 0.00 0 16
1 0.071531 0.353116 0.00 0 11
2 0.130875 3.170423 0.00 0 166
3 0.075922 0.467403 0.00 0 18
4 0.167146 3.928216 0.00 0 167
5 0.069928 0.357949 0.00 0 11
6 0.148990 3.536247 0.00 0 183
7 0.090478 1.786360 0.00 0 92
8 47.902799 109.998464 0.00 0 479
9 0.080216 0.373228 0.00 0 12

This I find rather surprising!

This measurement is based on some 27k samples, measured against a BIND
which processes 6-800 qps, so not overly busy, but similar behaviour
is also observed with BINDs which process only 50-150 qps, so the
query load does not appear to affect this significantly -- the result
for the two other BIND installations I've looked at were (based on 24k
and 20k samples):

Ixmean   stddev   median min max
0 15.031710 0.736819 15.00 14 46
1 15.041801 1.359741 15.00 14 77
2 15.003519 0.248714 15.00 14 20
3 15.008326 0.258005 15.00 14 19
4 15.018990 0.447535 15.00 14 30
5 15.007415 0.289008 15.00 14 20
6 31.504860 42.598276 15.00 14 307
7 15.128158 3.613397 15.00 14 133
8 15.055274 2.329867 15.00 14 123
9 15.011628 0.305681 15.00 14 20

and

Ixmean   stddev   median min max
0 9.015588 2.394387 9.00 8 121
1 8.994696 2.334736 9.00 8 124
2 8.966915 0.985907 9.00 8 55
3 8.998417 2.259316 9.00 8 122
4 24.782942 42.980507 9.00 8 265
5 8.961137 0.363112 9.00 8 17
6 8.947259 0.225413 9.00 8 10
7 9.048827 3.174344 9.00 8 128
8 9.004134 2.429312 9.00 8 128
9 9.010314 2.294347 9.00 8 121

Any idea what might be causing this?  Anything I can do to remedy
the behaviour?

Best regards,

- Håvard
#! /usr/bin/perl

use strict;
use Statistics::Descriptive;

our(@d);

while(<>) {
chomp;
@_ = split;
my($i) = $_[0];
my($v) = $_[1];
if (!defined($d[$i])) {
$d[$i] = ();
}
push(@{ $d[$i] }, $v);
}

printf("Ixmean   stddev   median min max\n");
foreach my $i (0 .. $#d) {
my $stat = Statistics::Descriptive::Full->new();
$stat->add_data(@{ $d[$i] });

my $mean = $stat->mean;
my $stddev  = $stat->standard_deviation;
my $median  = $stat->median;
my $min = $stat->min;
my $max = $stat->max;

printf("%d %f %f %f %d %d\n", $i, $mean, $stddev, $median, $min, $max);
}
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users