RE: Bind9.5.1 under no Root Name Servers

2009-07-17 Thread Todd Snyder
Martin,

It looks like you were relying on an odd mechanism to determine an
outage.  What you were seeing is the server filling up all the available
recursive slots because they weren't getting answered, backing up the
queue.  It wasn't necessarily an indication of an outage, it could have
meant that you had too many people trying to do lookups at once.
However, I suspect that worked well for you, and would generally
indicate there was a problem.

I'd suggest instead using stats to look for problems.  We've been
testing running rndc stats every couple of minutes on a server, then
parsing that data to both dump into a DB to graph the results, and to
raise alerts.  With some pretty simple programming, you can keep a
rolling average of errors.  Then, if you get a value that's more than X
above that average, you could raise an alert, or consider that to be an
outage.  What's harder is getting a really good way to detect
abnormal numbers of queries, as the average isn't the best way.
Weekends are lower, weekdays are higher ... I guess the best way to do
it would be to have a daily average (Monday-Sunday) and if the current
errors is greater than that days norm, it's abnormal.  But I digress...

In your situation, looking for hard downs on your connectivity, you
would see successful queries drop to 0 (or near 0), and your errors ramp
up.  that wouldn't be a hard one to detect programmatically.  

The other nice thing about putting this all into a DB is that you can
look back and get historical stats quite easily.

Look at tools like rrd/cacti for graphing, and we've been using perl for
the monitoring stuff.  

Not quite as simple as looking for log lines, but all pretty easy
overall, and has some nice bonuses.

Cheers,

Todd.

-Original Message-
From: bind-users-boun...@lists.isc.org
[mailto:bind-users-boun...@lists.isc.org] On Behalf Of Martin McCormick
Sent: Friday, July 17, 2009 9:20 AM
To: bind-us...@isc.org
Subject: Bind9.5.1 under no Root Name Servers

What does bind9.5.1 do when there is an Internet issue and we
loose all root name servers?

The bind9.3.x we had been running always began producing
tons of lines saying that there were no more recursive clients. I
had written a program that looked for the time stamp when the
mess starts and then for the time stamp of the last distress
call and we called that an outage since bind certainly wasn't
happy.

We had a very brief outage on the day we switched to
bind9.5.1 and I saw nothing remarkable in the named.log file
during the period where we lost all roots. Either bind9.5.1
doesn't produce this message or the hit just didn't last long
enough for all the recursive slots to fill up.

We do allow recursion from within our network but
disallow it for 3RD parties.

Bind is an excellent place to take the pulse of one's
whole network since it is so closely tied to everything else.

Here is an actual example of the message we look for:

08-Jul-2009 08:38:20.296 client 139.78.102.224#53631:
 no more recursive clients: quota reached

Martin McCormick WB5AGZ  Stillwater, OK 
Systems Engineer
OSU Information Technology Department Telecommunications Services Group
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Bind9.5.1 under no Root Name Servers

2009-07-17 Thread Mark Andrews

In message 200907171319.n6hdjs31003...@dc.cis.okstate.edu, Martin McCormick 
writes:
 What does bind9.5.1 do when there is an Internet issue and we
 loose all root name servers?
 
   The bind9.3.x we had been running always began producing
 tons of lines saying that there were no more recursive clients. I
 had written a program that looked for the time stamp when the
 mess starts and then for the time stamp of the last distress
 call and we called that an outage since bind certainly wasn't
 happy.
 
   We had a very brief outage on the day we switched to
 bind9.5.1 and I saw nothing remarkable in the named.log file
 during the period where we lost all roots. Either bind9.5.1
 doesn't produce this message or the hit just didn't last long
 enough for all the recursive slots to fill up.
 
   We do allow recursion from within our network but
 disallow it for 3RD parties.
 
   Bind is an excellent place to take the pulse of one's
 whole network since it is so closely tied to everything else.
 
   Here is an actual example of the message we look for:
 
 08-Jul-2009 08:38:20.296 client 139.78.102.224#53631:
  no more recursive clients: quota reached
 
 Martin McCormick WB5AGZ  Stillwater, OK 
 Systems Engineer
 OSU Information Technology Department Telecommunications Services Group
 ___
 bind-users mailing list
 bind-users@lists.isc.org
 https://lists.isc.org/mailman/listinfo/bind-users

BIND 9.5.1 does a better job of shedding load when the
nameservers for a query are unreachable than 9.3.x does.
BIND 9.5.1 also detects duplicate queries and drops them,
BIND 9.3.x doesn't.  Both of these will help prevent the
recursive quota being reached.

BIND 9.5.1 will only allow so many queries for a given
qname,qtype,qclass to queue, after that it will just drop
new queries (servfail TCP).  The pushes the queuing back
into the clients.  The amount of queries auto tunes and
ranges between 10 and 100 clients per query in a default
configuration.

Mark

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users