In message <4de9045c.2050...@anl.gov>, Barry Finkel writes:
> I have a problem with BIND 9.7.x on Ubuntu.
> I have two servers that are running 9.7.3.
> They slave 332 zones, and they also master 213,750
> malware/spyware zones that we have defined to reroute these
> domains to a local machine.
>
> When I was upgrading the BIND to 9.7.3-P1 yesterday, an
>
> ./rndc stop
>
> command ran over 8 minutes, and named did not stop.
> A "kill" command did not work; I had to revert to a
> "kill -9" command. What was BIND doing? Gracefully
> closing all of the zones?
Most probably. "rndc stop" ensures that masterfiles are up-to-date
before exiting. "rndc halt" does not try to flush master files
before exiting.
There could also have been a reference leak causing named to not
stop.
> BIND 9.7.3-P1 came up fine, but there are two things that concern me:
>
> 1) After BIND began responding to queries, it was using
> 100% of the CPU for about three minutes. I am not sure what
> BIND was doing. This is not major because BIND was handling
> customer queries, and after the three minutes the CPU usage
> dropped to a normal 1%.
>
> 2) Two zones reported serial number decreases. This is bad.
>
> I did some research on the two zones - both Microsoft
> Active Directory zones (one _tcp and one _udp) that are mastered
> on a Windows Domain Controller and slaved on my BIND boxes.
> I have around 44 AD zones I slave, and only these two reported
> problems - on my two internal Ubuntu slaves and my two Solaris 10
> slaves. The two Solaris 10 slaves do not run the spyware zones,
> so I had no problem with "./rndc stop". I therefore am not sure
> that the serial number problems are due to the "kill -9".
They shouldn't be. The handling of master files and journals is
designed to have the power be pull at anytime provided the filesystem
supports atomic replacement of files.
> I looked at the serial number issue on these two zones in detail;
> I capture the serial numbers on all the AD zones each morning at
> 6:10. Here is information for the _tcp zone:
>
> DateZone Mast Slav Slav
> 20 Oct 2010 _tcp. 1233 1233 1233
> 21 Oct 2010 _tcp. 1239 1239 1239 The master incremented the serial.
> ...
> 09 Nov 2010 _tcp. 1239 1239 1239
> 10 Nov 2010 _tcp. 1238 1239 1239 Master decreased due to MS patch
> 11 Nov 2010 _tcp. 1238 1238 1238
> ...
> 03 Dec 2010 _tcp. 1238 1238 1238
> 04 Dec 2010 _tcp. 1238 1238 1239 ??
> 05 Dec 2010 _tcp. 1238 1239 1238 ??
> 06 Dec 2010 _tcp. 1238 1238 1238
> ...
> 09 Dec 2010 _tcp. 1238 1238 1238
> 10 Dec 2010 _tcp. 1238 1238 1239 ??
> 11 Dec 2010 _tcp. 1238 1239 1238 ??
> 12 Dec 2010 _tcp. 1238 1238 1238
> ...
> 05 Jan 2011 _tcp. 1238 1238 1238
> 06 Jan 2011 _tcp. 1238 1239 1239 ??
> 07 Jan 2011 _tcp. 1238 1238 1238
> ...
> 02 Mar 2011 _tcp. 1238 1238 1238 Upgrade 9.7.2-P3 to 9.7.3
> 03 Mar 2011 _tcp. 1238 1239 1239
> 04 Mar 2011 _tcp. 1238 1238 1238
> ...
> 16 Apr 2011 _tcp. 1238 1238 1238
> 17 Apr 2011 _tcp. 1238 1238 1238 1238 1238 Two Sol10 slaves added.
> ...
> 02 Jun 2011 _tcp. 1238 1238 1238 1238 1238 Upgrade 9.7.3 to 9.7.3-P1
> 03 Jun 2011 _tcp. 1238 1239 1239 1239 1239
>
> Both Ubuntu slaves have been up for 149 days (reboot around Jan 15).
> The zone serial was 1239 until a MS patch run on the Domain
> Controller decreased the serial by one on the evening of Nov 9.
> I did nothing to correct the problem; I waited for the two zones
> to expire, and then new zones were transferred from the Windows
> master server. The serial number was 1238 on the master and
> slaves. On a few days, the serial on the slaves increased
> by one, and I am not sure what happened on those days.
>
> On Mar 02 I upgraded BIND from 9.7.2-P3 to 9.7.3, and the
> serial numbers on the two upgraded BIND slaves reverted to the
> higher 1239 serial. Again, I did no fixup, and on Mar 04
> the serials were the same at the lower value. I think that the
> serial number decrease was temporary during the patch run.
> On Apr 17 I added the two Solaris 10 slaves to my morning report, and
> all five serials were contant at 1238 until I upgraded BIND Tuesday (on
> the Solaris 10 boxes) and yesterday (on the Ubuntu boxes). Immediately
> after the upgrade BIND reported the serial number problem on these two
> zones. The other AD zones have had no serial number problems.
>
> I have no idea why BIND would remember the increased 1239
> serial number, when the serial number for the zone has been constant
> at 1238 since Mar 04. I have to assume that between Mar 04 and
> Jun 03 BIND would have written the zone to disk, either in the
> base zone file or a .jnl file.
>
> --
> --
> Barry S. Finkel
> Computing and Information Systems Division
> Argon