Public bug reported: Binary package hint: bind
Bind lived up to its name this morning. I have a bind server that effectively serves triple duty providing: - Public zones - Private zones (for the lan) - Recursive lookups and caching for lan hosts The first two are effectively the same thing except for some ACLs, but that's really beside the point. Anyway, for the sake of this example, assume the public zones and private zones are all slave zones. The public zones load over public ip addresses, while the private (lan) zones load over an ipsec connection to the master's network. This morning the ipsec connection went away, as it occasionally does, and shortly thereafter so did bind. What really boggled me was that named would become completely unresponsive, even after rebooting the server, within a couple minutes of startup. It would refuse to do any lookups, would fail to resolve its own zones, and even failed to respond to rndc for restarts. Basically, I'd have to kill -9 it, then restart. It would work for a minute or two, then re-hang. After poking around for some time, I broke out strace and had a look at a few of the threads running under named. One thread looked like it was waiting for input from the master over the down ipsec connection. Another appeared to be in a bit of an infinite loop of the following: clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT (Connection t imed out) gettimeofday({1198072310, 377730}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection ti med out) gettimeofday({1198072310, 392700}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection ti med out) gettimeofday({1198072310, 473689}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT (Connection t imed out) gettimeofday({1198072310, 976660}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT (Connection t imed out) gettimeofday({1198072311, 478563}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK (T o be restarted) I took the hint and commented out the couple private zones thats required the master over ipsec. Following that, named has stayed up and running as normal. Apparently somewhere in the bind code, if it doesn't hear back from a master it will literally wait forever and stop serving all data. This, imo, is not good. I also have the following additional observations to add: - This is not the first time the ipsec connection has gone away, but it's the first time I've seen this. It may also be the first time ipsec has been down since upgrading to edgy, so the problem may be new in bind 9.4. It could also be a bizarre coincidence. - The public zones, which resolve over public ip addresses, did not cause a failure even when their master was unreachable. This leads me to believe that there is something about the way ipsec dealt with bind's queries that was creating the condition, but I still think it's a condition bind should be able to deal with. ** Affects: bind (Ubuntu) Importance: Undecided Status: New ** Description changed: Binary package hint: bind Bind lived up to its name this morning. I have a bind server that - effectively serves triple duty surving: + effectively serves triple duty providing: - Public zones - Private zones (for the lan) - Recursive lookups and caching for lan hosts The first two are effectively the same thing except for some ACLs, but that's really beside the point. Anyway, for the sake of this example, assume the public zones and private zones are all slave zones. The public zones load over public ip addresses, while the private (lan) zones load over an ipsec connection to the master's network. This morning the ipsec connection went away, as it occasionally does, and shortly thereafter so did bind. What really boggled me was that named would become completely unresponsive, even after rebooting the server, within a couple minutes of startup. It would refuse to do any lookups, would fail to resolve its own zones, and even failed to respond to rndc for restarts. Basically, I'd have to kill -9 it, then restart. It would work for a minute or two, then re-hang. After poking around for some time, I broke out strace and had a look at a few of the threads running under named. One thread looked like it was waiting for input from the master over the down ipsec connection. Another appeared to be in a bit of an infinite loop of the following: clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT (Connection t imed out) gettimeofday({1198072310, 377730}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection ti med out) gettimeofday({1198072310, 392700}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection ti med out) gettimeofday({1198072310, 473689}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT (Connection t imed out) gettimeofday({1198072310, 976660}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT (Connection t imed out) gettimeofday({1198072311, 478563}, NULL) = 0 futex(0xb7a72010, FUTEX_WAKE, 1) = 0 clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0 futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK (T o be restarted) I took the hint and commented out the couple private zones thats required the master over ipsec. Following that, named has stayed up and running as normal. Apparently somewhere in the bind code, if it doesn't hear back from a master it will literally wait forever and stop serving all data. This, imo, is not good. I also have the following additional observations to add: - This is not the first time the ipsec connection has gone away, but it's the first time I've seen this. It may also be the first time ipsec has been down since upgrading to edgy, so the problem may be new in bind 9.4. It could also be a bizarre coincidence. - The public zones, which resolve over public ip addresses, did not cause a failure even when their master was unreachable. This leads me to believe that there is something about the way ipsec dealt with bind's queries that was creating the condition, but I still think it's a condition bind should be able to deal with. -- loss of masters causing bind to become unresponsive https://bugs.launchpad.net/bugs/177489 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs