Public bug reported:

Binary package hint: bind

Bind lived up to its name this morning.  I have a bind server that
effectively serves triple duty providing:

- Public zones
- Private zones (for the lan)
- Recursive lookups and caching for lan hosts

The first two are effectively the same thing except for some ACLs, but
that's really beside the point.  Anyway, for the sake of this example,
assume the public zones and private zones are all slave zones.  The
public zones load over public ip addresses, while the private (lan)
zones load over an ipsec connection to the master's network.

This morning the ipsec connection went away, as it occasionally does,
and shortly thereafter so did bind.  What really boggled me was that
named would become completely unresponsive, even after rebooting the
server, within a couple minutes of startup.  It would refuse to do any
lookups, would fail to resolve its own zones, and even failed to respond
to rndc for restarts.  Basically, I'd have to kill -9 it, then restart.
It would work for a minute or two, then re-hang.

After poking around for some time, I broke out strace and had a look at
a few of the threads running under named.  One thread looked like it was
waiting for input from the master over the down ipsec connection.
Another appeared to be in a bit of an infinite loop of the following:

clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT (Connection 
t                                                          imed out)
gettimeofday({1198072310, 377730}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection 
ti                                                          med out)
gettimeofday({1198072310, 392700}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection 
ti                                                          med out)
gettimeofday({1198072310, 473689}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT (Connection 
t                                                          imed out)
gettimeofday({1198072310, 976660}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT (Connection 
t                                                          imed out)
gettimeofday({1198072311, 478563}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK 
(T                                                          o be restarted)


I took the hint and commented out the couple private zones thats required the 
master over ipsec.  Following that, named has stayed up and running as normal.  
Apparently somewhere in the bind code, if it doesn't hear back from a master it 
will literally wait forever and stop serving all data.  This, imo, is not good.

I also have the following additional observations to add:
- This is not the first time the ipsec connection has gone away, but it's the 
first time I've seen this.  It may also be the first time ipsec has been down 
since upgrading to edgy, so the problem may be new in bind 9.4.  It could also 
be a bizarre coincidence.
- The public zones, which resolve over public ip addresses, did not cause a 
failure even when their master was unreachable.  This leads me to believe that 
there is something about the way ipsec dealt with bind's queries that was 
creating the condition, but I still think it's a condition bind should be able 
to deal with.

** Affects: bind (Ubuntu)
     Importance: Undecided
         Status: New

** Description changed:

  Binary package hint: bind
  
  Bind lived up to its name this morning.  I have a bind server that
- effectively serves triple duty surving:
+ effectively serves triple duty providing:
  
  - Public zones
  - Private zones (for the lan)
  - Recursive lookups and caching for lan hosts
  
  The first two are effectively the same thing except for some ACLs, but
  that's really beside the point.  Anyway, for the sake of this example,
  assume the public zones and private zones are all slave zones.  The
  public zones load over public ip addresses, while the private (lan)
  zones load over an ipsec connection to the master's network.
  
  This morning the ipsec connection went away, as it occasionally does,
  and shortly thereafter so did bind.  What really boggled me was that
  named would become completely unresponsive, even after rebooting the
  server, within a couple minutes of startup.  It would refuse to do any
  lookups, would fail to resolve its own zones, and even failed to respond
  to rndc for restarts.  Basically, I'd have to kill -9 it, then restart.
  It would work for a minute or two, then re-hang.
  
  After poking around for some time, I broke out strace and had a look at
  a few of the threads running under named.  One thread looked like it was
  waiting for input from the master over the down ipsec connection.
  Another appeared to be in a bit of an infinite loop of the following:
  
  clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0
  futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT 
(Connection t                                                          imed out)
  gettimeofday({1198072310, 377730}, NULL) = 0
  futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
  clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0
  futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection 
ti                                                          med out)
  gettimeofday({1198072310, 392700}, NULL) = 0
  futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
  clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0
  futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection 
ti                                                          med out)
  gettimeofday({1198072310, 473689}, NULL) = 0
  futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
  clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0
  futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT 
(Connection t                                                          imed out)
  gettimeofday({1198072310, 976660}, NULL) = 0
  futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
  clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0
  futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT 
(Connection t                                                          imed out)
  gettimeofday({1198072311, 478563}, NULL) = 0
  futex(0xb7a72010, FUTEX_WAKE, 1)        = 0
  clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0
  futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK 
(T                                                          o be restarted)
  
  
  I took the hint and commented out the couple private zones thats required the 
master over ipsec.  Following that, named has stayed up and running as normal.  
Apparently somewhere in the bind code, if it doesn't hear back from a master it 
will literally wait forever and stop serving all data.  This, imo, is not good.
  
  I also have the following additional observations to add:
  - This is not the first time the ipsec connection has gone away, but it's the 
first time I've seen this.  It may also be the first time ipsec has been down 
since upgrading to edgy, so the problem may be new in bind 9.4.  It could also 
be a bizarre coincidence.
  - The public zones, which resolve over public ip addresses, did not cause a 
failure even when their master was unreachable.  This leads me to believe that 
there is something about the way ipsec dealt with bind's queries that was 
creating the condition, but I still think it's a condition bind should be able 
to deal with.

-- 
loss of masters causing bind to become unresponsive
https://bugs.launchpad.net/bugs/177489
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to