I just randomly spotted this post, and thought I would toss in 2ยข How many nics and how many it's are on the servers? Are the failing clients on the same subnet as the server?
-- Gordon A. Lang On Thu, May 30, 2019, 8:10 PM Gregory Sloop <gr...@sloop.net> wrote: > So, this is a very odd situation and I'm kind of grasping at straws here. > So, I've come to see if any of you have any good straws! > > The setup. > --- > Ubuntu 18.04 LTS is the distro we're running on. > All software is packaged [from the distro] - not compiled from sources. > Bind9 acting as a recursive resolver for a smallish network. 150 seats. > They're also handling DHCP and Chrony/NTP requests. > [I actually have a pair of these handling DNS/DHCP/NTP this is the master.] > > They are running on a Xen/XCP VM. > > The one I'm having problems is the master for several internal zones - the > one that's working fine is the slave for those same zones. None of the > zones are large. > > Intermittently, Bind9 simply stops handling queries from *some* hosts. > Meaning, it simply times out for responses for those hosts. > Yet BIND *is* working fine for lots of other machines on the same > networks. It's working fine doing dig queries locally on the server, and > handles dns queries fine for lots of other machines. Yet, again, some > machines simply get time-outs. I can't find any pattern to which machines > get timeouts and which don't. > > I've checked - no firewalls, fail2ban or the like that might be causing > this. > No selinux/apparmour. > Hosts that can't do dns queries can ping the dns server fine. > [So, there's at least some network pathway to the DNS machine.] > > Review of the logs for bind don't show anything that looks like a problem > to me. > [But I'm not sure what keywords I ought to be looking for, in an effort to > find symptoms/problems.] > > Finally, the two bind/dhcp/ntp servers are currently running on the same > Xen host, so if it's somehow host related, I'd expect both to have > problems, but they don't. > > Top doesn't show any CPU distress. > Processes look fine > Memory in use is far below what allocated to the machine. [1G allocated, > like <400M used.] > Restart of BIND doesn't do anything, at least in the cases I've seen - > which aren't all that many yet. > A restart of the whole VM does appear to fix the issue immediately. > These appear to occur every 3-5 days. > Oh, and if you simply wait, it eventually starts handling queries for all > hosts again - but it might be a couple+ hours. > > Any suggestions on things I might hunt for in the logs in an attempt to > figure out what's happening? > Other suggestions for things to look for/consider? > <gr...@sloop.net> > TIA > -Greg > _______________________________________________ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users >
_______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users