Re: Bug - remote DNS monitoring
> On Aug 30, 2022, at 1:12 PM, Casey Deccio wrote: > > I am having trouble tracking down a bug in my monitoring setup. It all > happened when I upgraded the monitored host (host B in my example below) to > bullseye. Note that Host A is also running bullseye, but the problem didn't > show itself until Host B was upgraded. With some help over at the bind-users mailing list [1], I discovered that nrpe-ng closes stdin when launching the command [2], and the new version of nslookup (invoked by check_dns) has issues when stdin is closed [3]. Redirecting stdin to /dev/null fixes the issue: $ diff -u /usr/lib/python3/dist-packages/nrpe_ng/commands.py{.old,} --- /usr/lib/python3/dist-packages/nrpe_ng/commands.py.old 2017-08-08 13:05:02.0 -0600 +++ /usr/lib/python3/dist-packages/nrpe_ng/commands.py 2022-09-13 17:00:36.767239885 -0600 @@ -85,6 +85,7 @@ proc = tornado.process.Subprocess( run_args, +stdin=subprocess.DEVNULL, stdout=tornado.process.Subprocess.STREAM, close_fds=True, env=env) I've filed a bug report [4]. Thanks, Casey [1] https://lists.isc.org/pipermail/bind-users/2022-September/10.html [2] https://github.com/bootc/nrpe-ng/blob/master/nrpe_ng/commands.py#L86 [3] https://github.com/libuv/libuv/blob/v1.x/src/unix/core.c#L602 [4] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1019718
Re: Bug - remote DNS monitoring
> On Aug 30, 2022, at 1:40 PM, Nicholas Geovanis wrote: > > When you run check_dns by hand on Host B, you don't say who you are logged-in > as. That can make a difference. Nagios runs its scripts in a known > environment which may be different than you expect. > Thanks for the question. I have run the check_dns script with: - an arbitrary, unprivileged user - the nagios user (also unprivileged) - root (gasp!) They all work just fine. Also, in every case, I run tcpdump, and I can see the DNS queries and responses going back and forth just fine. In the strace messages, I can also see that the DNS messages were written and read properly. I think the issue is in nslookup, some time *after* the send/recv. But I can't narrow it down much more than that. Casey
Re: Bug - remote DNS monitoring
On Tue, Aug 30, 2022, 2:13 PM Casey Deccio wrote: > Hi all, > > I am having trouble tracking down a bug in my monitoring setup. It all > happened when I upgraded the monitored host (host B in my example below) to > bullseye. Note that Host A is also running bullseye, but the problem > didn't show itself until Host B was upgraded. > > Here is the setup: > > Host A (monitoring): > Installed: nagios4, nrpe-ng > IP address: 192.0.2.1 > > Host B (monitored): > Installed: nrpe-ng, monitoring-plugins-standard, bind9-dnsutils > IP address: 192.0.2.2 > > Host C (monitored through host B): > Installed: bind9 > IP address: 192.0.2.3 > Configured to answer authoritatively for example.com on port 53. > > nrpe > over HTTPs DNS > Host A --> Host B -> Host C > When you run check_dns by hand on Host B, you don't say who you are logged-in as. That can make a difference. Nagios runs its scripts in a known environment which may be different than you expect. On Host B, I run the following: > sudo /usr/bin/python3 /usr/sbin/nrpe-ng --debug -f --config > /etc/nagios/nrpe-ng.cfg > > While that is running, I run the following on Host A: > /usr/lib/nagios/plugins/check_nrpe_ng -H 192.0.2.2 -c check_dns -a > example.com 192.0.2.3 0.1 1.0 > > The result of running the command on Host A is: > DNS CRITICAL - '/usr/bin/nslookup -sil' msg parsing exited with no address > > On Host B, I see the following debug output: > 200 POST /v1/check/check_dns (192.0.2.1) 78.05ms > Executing: /usr/lib/nagios/plugins/check_dns -H example.com -s 192.0.2.3 > -A -w 0.1 -c 1.0 > > When I run this exact command on Host B, I get: > $ /usr/lib/nagios/plugins/check_dns -H example.com -s 192.0.2.3 -A -w 0.1 > -c 1.0 > DNS OK: 0.070 seconds response time. example.com returns > 192.0.2.10,2001:db8::10|time=0.069825s;0.10;1.00;0.00 > > Looks good! When I run nslookup (run by check_dns), it looks good too: > $ /usr/bin/nslookup -sil example.com 192.0.2.3 > Server: 192.0.2.3 > Address: 192.0.2.3#53 > > Name: example.com > Address: 192.0.2.10 > Name: example.com > Address: 2001:db8::10 > > After rerunning nrpe-ng with strace -f, I see something: > > [pid 1183842] write(2, "nslookup: ./src/unix/core.c:570:"..., 83) = 83 > ... > [pid 1183841] read(4, "nslookup: ./src/unix/core.c:570:"..., 4096) = 83 > > So it appears that the nslookup process is reporting an error. But I > cannot reproduce it outside of nrpe-ng. > > Any suggestions? > > Casey >
Bug - remote DNS monitoring
Hi all, I am having trouble tracking down a bug in my monitoring setup. It all happened when I upgraded the monitored host (host B in my example below) to bullseye. Note that Host A is also running bullseye, but the problem didn't show itself until Host B was upgraded. Here is the setup: Host A (monitoring): Installed: nagios4, nrpe-ng IP address: 192.0.2.1 Host B (monitored): Installed: nrpe-ng, monitoring-plugins-standard, bind9-dnsutils IP address: 192.0.2.2 Host C (monitored through host B): Installed: bind9 IP address: 192.0.2.3 Configured to answer authoritatively for example.com on port 53. nrpe over HTTPs DNS Host A --> Host B -> Host C On Host B, I run the following: sudo /usr/bin/python3 /usr/sbin/nrpe-ng --debug -f --config /etc/nagios/nrpe-ng.cfg While that is running, I run the following on Host A: /usr/lib/nagios/plugins/check_nrpe_ng -H 192.0.2.2 -c check_dns -a example.com 192.0.2.3 0.1 1.0 The result of running the command on Host A is: DNS CRITICAL - '/usr/bin/nslookup -sil' msg parsing exited with no address On Host B, I see the following debug output: 200 POST /v1/check/check_dns (192.0.2.1) 78.05ms Executing: /usr/lib/nagios/plugins/check_dns -H example.com -s 192.0.2.3 -A -w 0.1 -c 1.0 When I run this exact command on Host B, I get: $ /usr/lib/nagios/plugins/check_dns -H example.com -s 192.0.2.3 -A -w 0.1 -c 1.0 DNS OK: 0.070 seconds response time. example.com returns 192.0.2.10,2001:db8::10|time=0.069825s;0.10;1.00;0.00 Looks good! When I run nslookup (run by check_dns), it looks good too: $ /usr/bin/nslookup -sil example.com 192.0.2.3 Server: 192.0.2.3 Address:192.0.2.3#53 Name: example.com Address: 192.0.2.10 Name: example.com Address: 2001:db8::10 After rerunning nrpe-ng with strace -f, I see something: [pid 1183842] write(2, "nslookup: ./src/unix/core.c:570:"..., 83) = 83 ... [pid 1183841] read(4, "nslookup: ./src/unix/core.c:570:"..., 4096) = 83 So it appears that the nslookup process is reporting an error. But I cannot reproduce it outside of nrpe-ng. Any suggestions? Casey