Martin, It looks like you were relying on an odd mechanism to determine an outage. What you were seeing is the server filling up all the available recursive "slots" because they weren't getting answered, backing up the queue. It wasn't necessarily an indication of an outage, it could have meant that you had too many people trying to do lookups at once. However, I suspect that worked well for you, and would generally indicate there was a problem.
I'd suggest instead using stats to look for problems. We've been testing running "rndc stats" every couple of minutes on a server, then parsing that data to both dump into a DB to graph the results, and to raise alerts. With some pretty simple programming, you can keep a rolling average of errors. Then, if you get a value that's more than X above that average, you could raise an alert, or consider that to be an "outage". What's harder is getting a really good way to detect "abnormal" numbers of queries, as the average isn't the best way. Weekends are lower, weekdays are higher ... I guess the best way to do it would be to have a daily average (Monday-Sunday) and if the current errors is greater than that days norm, it's abnormal. But I digress... In your situation, looking for hard downs on your connectivity, you would see successful queries drop to 0 (or near 0), and your errors ramp up. that wouldn't be a hard one to detect programmatically. The other nice thing about putting this all into a DB is that you can look back and get historical stats quite easily. Look at tools like rrd/cacti for graphing, and we've been using perl for the monitoring stuff. Not quite as simple as looking for log lines, but all pretty easy overall, and has some nice bonuses. Cheers, Todd. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Martin McCormick Sent: Friday, July 17, 2009 9:20 AM To: [email protected] Subject: Bind9.5.1 under no Root Name Servers What does bind9.5.1 do when there is an Internet issue and we loose all root name servers? The bind9.3.x we had been running always began producing tons of lines saying that there were no more recursive clients. I had written a program that looked for the time stamp when the mess starts and then for the time stamp of the last distress call and we called that an outage since bind certainly wasn't happy. We had a very brief outage on the day we switched to bind9.5.1 and I saw nothing remarkable in the named.log file during the period where we lost all roots. Either bind9.5.1 doesn't produce this message or the hit just didn't last long enough for all the recursive slots to fill up. We do allow recursion from within our network but disallow it for 3RD parties. Bind is an excellent place to take the pulse of one's whole network since it is so closely tied to everything else. Here is an actual example of the message we look for: 08-Jul-2009 08:38:20.296 client 139.78.102.224#53631: no more recursive clients: quota reached Martin McCormick WB5AGZ Stillwater, OK Systems Engineer OSU Information Technology Department Telecommunications Services Group _______________________________________________ bind-users mailing list [email protected] https://lists.isc.org/mailman/listinfo/bind-users --------------------------------------------------------------------- This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful. _______________________________________________ bind-users mailing list [email protected] https://lists.isc.org/mailman/listinfo/bind-users

