Re: Sporadic Timeouts after upgrading to bind9.20

2024-09-05 Thread Havard Eidnes via bind-users
> On our production name servers we have check every 30s if bind
> is alive by sending a SOA query to bind. Today I upgraded a few
> nodes from 9.18.x (x between 17 and 27) to 9.20.1 (Ubuntu 24.04
> with packages from ISC ppa).
>
> Since that, we have sporadic timeouts (3s). On the nodes with
> more qps we see it more often.
>
> Before I dig into the problem, are there any specific changes
> to 9.20 that I should look at? Maybe some default value changes
> for socket buffers, thread handling ...?

I can't answer specifically about BIND 9.20, I'm currently
tipping my toes carefully into the waters of "deploying BIND 9.20
as a recursor".

What you don't say anything about is whether you see increased
CPU load on your hosts, and whether the relationship between QPS
and CPU load has changed after upgrading to 9.20.  Also, what
general level of load do you observe on this / these host(s)?
E.g. "how close to the limit of what it can do" are you?


In our deployment, we monitor the relationship between the number
of "udp: dropped due to full socket buffers" and "udp: datagrams
received" (in our case via collectd / graphite / grafana), and
when we started doing that we found out that we needed to bump
the default UDP socket buffers quite a bit to get that event rate
to go down to acceptable rates.  Regrettably, as far as I know,
BIND does not have a knob to adjust the socket buffer size for
the UDP sockets BIND itself use, so what I ended up doing was
bumping the default for UDP sockets the entire host via sysctl.
In my case that's "fine" because the host is basically only
serving this single function.

Then again, I'm the weirdo running BIND on NetBSD, so the
defaults are probably widely different in your case.

Just an example from one of our publishing (non-recursive) BIND
servers, from "netstat -s" output:

udp:
1669688117 datagrams received
0 with incomplete header
10 with bad data length field
994 with bad checksum
10922 dropped due to no socket
874709 broadcast/multicast datagrams dropped due to no socket
890955 dropped due to full socket buffers
1667910527 delivered
2741883224 PCB hash misses
1632037948 datagrams output

which comes out to 0.05% as an overall average "drops due to full
socket buffers", but that doesn't mean there are occasional
(smallish) spikes in the rate, of course.  And this is with BIND
9.18.29.

In other words: I think more information is needed to help you
diagnose the issue.


Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-26 Thread Havard Eidnes via bind-users
> On Mon, Aug 26, 2024 at 06:05:19PM +0200, Havard Eidnes via bind-users wrote:
>> Thanks.  I found it, and it's more than a little embarassing.
>> 
>> This is what you get when not building with --with-libxml2: an
>> "un-rendered" xsl file as a result, in essence just the content
>> of bin/named/xsl.c.  And this happened because I wasn't paying
>> attention to what options were turned on by default for the
>> package I was putting together.  "Surely stats is on by default!"
>> Not so.  (Well, I didn't even think it was optional.)  Lesson
>> learned.
>
> It *is* on by default, if it can find libxml2. Does yours live in
> a nonstandard location?

Time for more confessions.

This is in NetBSD's pkgsrc, which only builds with explicitly
"buildlinked" libraries, so that build dependencies are
explicitly declared, and not automatically picked up from those
you just accidentally happen to have installed on the build host.
What I had overlooked was that I in /etc/mk.conf needed

PKG_OPTIONS.bind+=  bind-xml-statistics-server

It's another matter whether this one should default to "on" in
the package itself -- I'm leaning in that direction, but need to
discuss with some others before I change the default.  And I also
need the "dnstap" option in my deployment, so I need a custom
build anyway.  Like I said, "lesson learned".

> Perhaps, if libxml2 and libjson-c are both unavailable, we should
> disable statistics-channels in the configuration - at least that way
> the problem would've been easier to figure out.

Right, I was sort of thinking in that direction as well, but
would not be too insistent on something along those lines.
Perhaps return a web page saying "built without both libjson-c
and libxml2, so nothing to see here"?

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-26 Thread Havard Eidnes via bind-users
> If I was debugging this I would:
> - compared strace output from working and non-working server

I did parts of that, ref. that other message I sent.

> Unfortunately you are the only person who reported this problem and I
> can't reproduce it either, so it's probably up to you to find needle
> in the haystack. Good luck!

Thanks.  I found it, and it's more than a little embarassing.

This is what you get when not building with --with-libxml2: an
"un-rendered" xsl file as a result, in essence just the content
of bin/named/xsl.c.  And this happened because I wasn't paying
attention to what options were turned on by default for the
package I was putting together.  "Surely stats is on by default!"
Not so.  (Well, I didn't even think it was optional.)  Lesson
learned.

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-26 Thread Havard Eidnes via bind-users
BTW,

I got an off-line question how the chrooting is done in my case,
i.e. whether the "chroot" program is used, or the "-t" option to
BIND is used.

In my case it's the latter:

   -t directory
  This option tells named to chroot to directory after processing
  the command-line arguments, but before reading the configuration
  file.

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-26 Thread Havard Eidnes via bind-users
Hi,

and thanks for the suggestions.

This is not an issue of broken clocks, all the involved machines
run ntp and have good sync status traceable to at least a GPS clock.

This does however appear to have something to do with the
chroot'edness of this particular installation, and it's evident that
"something is missing" in the chroot, and that this "something" is a
run-time dependency of some sort.

I have one installation of 9.20.0 which doesn't run in a chroot, and
there the stats are rendering properly in my firefox browser (there's
some oddity with the graphics display in Chrome, will bring that up
separately).

Ktracing the start of the response to the statistics reports reveals a
marked difference:

The chroot'ed system's first few line of output:

 12931  12931 namedGIO   fd 1028 wrote 4088 bytes
   "HTTP/1.1 200 OK\r\nContent-Type: text/xslt+xml\r\nDate: Mon, 26 Aug 20\
24 08:05:10 GMT\r\nExpires: Mon, 26 Aug 2024 08:05:10 GMT\r\nLast-Modi\
fied: Sat, 24 Aug 2024 19:22:20 GMT\r\nCache-Control: public\r\nServer\
: libisc\r\nContent-Length: 39276\r\n\r\n\n\n\nhttp://www.w3.org/1999/xhtml\
\" version=\"1.0\">\n  \n  \n  \n \
 \n<\

while the non-chroot'ed system outputs:

   861861 namedGIO   fd 35 wrote 4088 bytes
   "HTTP/1.1 200 OK\r\nContent-Type: text/xml\r\nDate: Mon, 26 Aug 2024 08\
:15:10 GMT\r\nExpires: Mon, 26 Aug 2024 08:15:10 GMT\r\nLast-Modified:\
 Mon, 26 Aug 2024 08:15:10 GMT\r\nPragma: no-cache\r\nCache-Control: n\
o-cache\r\nServer: libisc\r\nContent-Length: 38449\r\n\r\n\n\n20\
24-08-16T17:12:39.761Z2024-08-26T07:44:26.863\
Z2024-08-26T08:15:10.620Z9.20.04534240<\
/counter>0Traffic Size\n\nServer Status\n\n   \
   \nBoot time:\n  \
  \n  \n   \
 \n  \n  \n  \
  Last reconfigured:\n\n  \n\n   \
   \n  \nCurrent time:<\
/th>\n\n  \n\n  \n  \nServer version:\n\n \
 \n\n  \n\n\nhttps://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-26 Thread Havard Eidnes via bind-users
>> Hi Håvard.
>> Have you tried a different browser?
>
> Not yet.  Will do tomorrow.

Latest Chrome on MacOS: just the same; it displays the raw XML
which isn't exactly user-friendly.

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-26 Thread Havard Eidnes via bind-users
Looking a bit further, I find in the XML output:

Server Status

  
Boot time:

  

  

So no actual value?  Is there a required post-processing step
which is omitted?  I see xsl is mentioned both here and in the
style definition at the start of the XML output.  I am however
way too unfamiliar with the various XML-related tools to tell
which piece is either missing or mal-functioning.

This particular name server instance is running in a chroot, so
naturally no external xsl processor is available (but surely BIND
doesn't do it that way)..

However, I don't find any "stray" references to XSLTPROC in the
code, so in case that transformation is supposed to be done in
some way, it must be done by some other method.  My libxml2 is
version 2.12.8, and is accepted by configure.

Regards,

- Håvard

-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND statistics

2024-08-25 Thread Havard Eidnes via bind-users
> Hi Håvard.
> Have you tried a different browser?

Not yet.  Will do tomorrow.

> Having said that, I just started 9.20.0 with this config:
>
> statistics-channels { inet 127.0.1.0 port 8080 ; };
>
> Then pointed three different browsers at that address/port and it looks
> fine to me in all of them.
> Browers tried were Chrome, Safari and Firefox.
>
> I can't reproduce your issue, sorry.

OK, thanks for checking anyway, will do more testing.

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


BIND statistics

2024-08-25 Thread Havard Eidnes via bind-users
Hi,

I'm mostly running BIND 9.18.x, and have configured statistics
publishing via

statistics-channels {
inet 127.0.0.1 port 8053 allow { 127.0.0.1; };
inet "actual-address" port 8053 allow { prefix1/24; prefix2/24; };
};

I've started testing 9.20.x.

I see BIND 9.20.x stats publishing is ... different.

If I use firefox and visit http://actual-address:8053/ with BIND
9.18.x, I get a reasonably rendered HTML display which is easy to
view.

Not so for BIND 9.20.x; I get an XML document which firefox (in
this particular case version 120.0) informs me at the top

   This XML file does not appear to have any style information
   associated with it. The document tree is shown below.

and the document starts with









https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"/>