Strange recursor response time pattern

2017-09-05 Thread Havard Eidnes
Hi,

one of my users made me aware of this rather unexpected behaviour
in our recursors running BIND 9.10.5-P3 on NetBSD/amd64:

It seems that every 10s, "on the clock", BIND will temporarily
increase the query response time rather drastically for a short while,
only to settle down to normal behaviour until the next 10s event.

A test I did to replicate the result was to run a shell loop with
a small sub-second sleep (yes, nonstandard):

while sleep .1; do
   date +%s
   dig @$recursor $query A +noadd +noauth
done > /tmp/out

Do this with a query for a name which exists and is in-cache.
Let it run for a while.  Then do:

awk '/Query time:/ { qt=$4 } /^1/ { print $1 " " qt }' /tmp/out > /tmp/qt

Plot the result and see a rather wide variation in response time.

Next do

awk '{ print $1 % 10 " " $2 }' /tmp/qt > /tmp/bins

and put the result through a small perl script (attached) to give
some basic statistics, and in my case I see:

Ixmean   stddev   median min max
0 0.083303 0.555366 0.00 0 16
1 0.071531 0.353116 0.00 0 11
2 0.130875 3.170423 0.00 0 166
3 0.075922 0.467403 0.00 0 18
4 0.167146 3.928216 0.00 0 167
5 0.069928 0.357949 0.00 0 11
6 0.148990 3.536247 0.00 0 183
7 0.090478 1.786360 0.00 0 92
8 47.902799 109.998464 0.00 0 479
9 0.080216 0.373228 0.00 0 12

This I find rather surprising!

This measurement is based on some 27k samples, measured against a BIND
which processes 6-800 qps, so not overly busy, but similar behaviour
is also observed with BINDs which process only 50-150 qps, so the
query load does not appear to affect this significantly -- the result
for the two other BIND installations I've looked at were (based on 24k
and 20k samples):

Ixmean   stddev   median min max
0 15.031710 0.736819 15.00 14 46
1 15.041801 1.359741 15.00 14 77
2 15.003519 0.248714 15.00 14 20
3 15.008326 0.258005 15.00 14 19
4 15.018990 0.447535 15.00 14 30
5 15.007415 0.289008 15.00 14 20
6 31.504860 42.598276 15.00 14 307
7 15.128158 3.613397 15.00 14 133
8 15.055274 2.329867 15.00 14 123
9 15.011628 0.305681 15.00 14 20

and

Ixmean   stddev   median min max
0 9.015588 2.394387 9.00 8 121
1 8.994696 2.334736 9.00 8 124
2 8.966915 0.985907 9.00 8 55
3 8.998417 2.259316 9.00 8 122
4 24.782942 42.980507 9.00 8 265
5 8.961137 0.363112 9.00 8 17
6 8.947259 0.225413 9.00 8 10
7 9.048827 3.174344 9.00 8 128
8 9.004134 2.429312 9.00 8 128
9 9.010314 2.294347 9.00 8 121

Any idea what might be causing this?  Anything I can do to remedy
the behaviour?

Best regards,

- Håvard
#! /usr/bin/perl

use strict;
use Statistics::Descriptive;

our(@d);

while(<>) {
chomp;
@_ = split;
my($i) = $_[0];
my($v) = $_[1];
if (!defined($d[$i])) {
$d[$i] = ();
}
push(@{ $d[$i] }, $v);
}

printf("Ixmean   stddev   median min max\n");
foreach my $i (0 .. $#d) {
my $stat = Statistics::Descriptive::Full->new();
$stat->add_data(@{ $d[$i] });

my $mean = $stat->mean;
my $stddev  = $stat->standard_deviation;
my $median  = $stat->median;
my $min = $stat->min;
my $max = $stat->max;

printf("%d %f %f %f %d %d\n", $i, $mean, $stddev, $median, $min, $max);
}
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Strange recursor response time pattern

2017-09-05 Thread Matus UHLAR - fantomas

On 05.09.17 16:44, Havard Eidnes wrote:

It seems that every 10s, "on the clock", BIND will temporarily
increase the query response time rather drastically for a short while,
only to settle down to normal behaviour until the next 10s event.



Any idea what might be causing this?  Anything I can do to remedy
the behaviour?


BIND keeps track of servers it's sending queries to and their RTTs to prefer
sending queries to servers that respond more quickly.

ocasionally, it re-sends queries even to slower or unreachable servers to
see if the situation changed.

what you see may be result of this algorithm.
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I just got lost in thought. It was unfamiliar territory. 
___

Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Havard Eidnes
Hmm...

some further local discussion has made me aware that us running
"collectd" for monitoring BIND may be contributing to the
problem; collectd fetches data each 10s by using the BIND-
configured statistics-channel, thus BIND is processing a TCP
connection to deliver the statistics data.

It's still somewhat surprising and disappointing that this should
interfere this much with DNS query processing...

Best regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Ray Bellis
On 05/09/2017 16:56, Havard Eidnes wrote:
> Hmm...
> 
> some further local discussion has made me aware that us running
> "collectd" for monitoring BIND may be contributing to the
> problem; collectd fetches data each 10s by using the BIND-
> configured statistics-channel, thus BIND is processing a TCP
> connection to deliver the statistics data.
> 
> It's still somewhat surprising and disappointing that this should
> interfere this much with DNS query processing...

There are various URLs (see the BIND 9 ARM) that provide a subset of the
full statistics.

The stats channel output relating to running tasks and memory contexts
is very extensive.

If collectd doesn't need the full set, you may be able to ask for just
the traffic-volume related subset(s).

Ray

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Strange recursor response time pattern

2017-09-05 Thread Matthew Pounsett
On 5 September 2017 at 11:56, Havard Eidnes  wrote:

> Hmm...
>
> some further local discussion has made me aware that us running
> "collectd" for monitoring BIND may be contributing to the
> problem; collectd fetches data each 10s by using the BIND-
> configured statistics-channel, thus BIND is processing a TCP
> connection to deliver the statistics data.
>
> It's still somewhat surprising and disappointing that this should
> interfere this much with DNS query processing...
>
> Is that pulling the old-style stats file, or the HTTP-based stats channel?


If the latter... the zone list (and by extension the root document) seems
to take a long time to process, and involves some sort of locking that
blocks all query processing while the list is being generated.  We
encountered this on a 3+ million zone instance.. BIND would stop answering
queries for several minutes if anyone requested the root stats document or
the zone list.

As Ray says, you may be better off individually querying each of the other
documents and processing those rather than polling the root doc to get them
all in one shot.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Strange recursor response time pattern

2017-09-05 Thread Havard Eidnes
>> some further local discussion has made me aware that us running
>> "collectd" for monitoring BIND may be contributing to the
>> problem; collectd fetches data each 10s by using the BIND-
>> configured statistics-channel, thus BIND is processing a TCP
>> connection to deliver the statistics data.
>>
>> It's still somewhat surprising and disappointing that this should
>> interfere this much with DNS query processing...
>
> There are various URLs (see the BIND 9 ARM) that provide a subset of the
> full statistics.
>
> The stats channel output relating to running tasks and memory contexts
> is very extensive.

Either way I would not have expected use of the statistics
channel to negatively impact the query performance.  Is the query
channel processed with "no-delay", so that a thread doesn't get
stuck waiting for data to drain from the other end?

> If collectd doesn't need the full set, you may be able to ask for just
> the traffic-volume related subset(s).

I'm using the "bind" plugin in collectd in its default
configuration.

The code for the BIND plugin of collectd is at

  https://github.com/collectd/collectd/blob/master/src/bind.c

...and it's not me who's written it.  We use grafana as the
display frontend, and the choice of what data to graph is made
through that interface.  Therefore it may not be as easy to
restrict what data is fetched via the stats channel.

I'm running BIND on a 16-core AMD Opteron 6274 processor, and
BIND is running a thread on each core.  The machine has 16G
memory, and is not starved in that department either.  I'm
therefore still rather surprised about the impact of using the
stats channel.

The other machines in my test are similarly or better spec'ed,
Xeon E5-2640 (16 cores), 32G memory, and they are also not
resource-starved.

Regards,

- Håvard
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


BIND and ulimit's on Linux

2017-09-05 Thread Browne, Stuart via bind-users
Hi,

Just a quick question.

I've recently run in to another daemon (not associated with BIND) that 
inherited its 'nofile' ulimit before dropping privileges and was wanting to 
confirm that BIND doesn't work this way.

On some of our servers (zone distribution points) where lots of AXFR's (over 
LAN) come in, we occasionally see them not respond to queries for brief times. 
The servers are reporting a 'file-nr' value close (but not exceeding) root's 
nofile ulimit, and was wondering if this could be related.

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users