Karl Auer wrote: >> The fail-over protocol does not work. Full-stop. > > Unless you come up with some very clever definition of "does not work", > that's just plain wrong, Alan. It clearly *does* work, most of the time > for most of the people, and has been doing so in enterprises large and > small for many years.
It does something. But it doesn't meet the goal of reliability. My issue is less with the protocol itself than with the belief that it *does* work. My experience with it has been less than positive. > The fact that I haven't had a serious failure in the last eight years or > so is a pretty good indicator to me that the protocol is robust > *enough*. You've been lucky. See the RELNOTES that is included with ISC for a series of bug fixes to the protocol. Both the implementation and the protocol design have been changed substantially to avoid issues seen by real-live people in the field. > That is true of any protocol you care to name. It's also an unanswerable > non-argument. Does inspection of the DHCP failover protocol reveal a > theoretical failure mode to you? Yes. A few quick tests demonstrated that failure. See earlier messages in this thread. > Or is it that the ISC DHCP implementation that has exhibited failures? They've *admitted* failure. Publicly. --- FAILOVER: As of version 3.0.4, ISC has included a fix for an insidious bug in the failover implementation which, if left unchecked, could result in tying up all leases in transitional states (such as released, reset, or expired). The crux of the problem is the lack of retransmission of leases that rest in these states. The only way to solve this problem is to carry additional state on the lease data structures to indicate acknowledgement state. --- That doesn't inspire confidence. It's not just a bug, which even FreeRADIUS has had from time to time. The entire design of the protocol has mutated and changed based on discovery of something they missed... YEARS after the protocol was implemented. See also the massive changes in the protocol between 3.0 and 3.1. It's just like the duplicate detection cache implemented in FreeRADIUS nearly 10 years ago, and by myself in other servers before that. Yet it was only with the recent publication of RFC 5080 that some major commercial servers went "Oh, that's a good idea...", and implemented it. Until they did, they were subject to a number of *known* problems. > That's possible. That never occurred to me, because it is allegedly > interoperable with ISC DHCP. I will ask! http://www.tolly.net/ts/2008/Nominum/DHCP2.2/Tolly208319NominumDHCP.pdf They might be inter-operable. The major performance difference between the two proves to me that the protocol between the Nominum servers is *not* the same as the ones used between ISC servers, or between ISC and Nominum. i.e. ISC claims to implement the protocol. If its performance is so much worse than Nomimum, then either (a), ISC didn't implement the protocol as spec'd, or (b) Nominum didn't. And much of the rest of the performance difference is due to ISC *not* using simple algorithms like "dynamic hash tables". It's almost like ISC is *deliberately* bad, to make people go to Nominum. That's OK. It leaves a window of opportunity for me, to create a DHCP server that *isn't* deliberately bad. > It almost always works. It works *by far* most of the time. Even with > ISC DHCP. To the point where I have not ever seen it fail except due to > bugs in an implementation. My experience is not all-encompassing - > perhaps you have seen it fail when the protocol was properly > implemented. Yes. > Yes. Or rather, it's delays in the operation of the failover protocol as > implemented in ISC DHCP. Or I believe it to be - feel free to educate me > otherwise. I really don't know. I'm happy to say that both the protocol and the implementation are "less than optimal". >> And I'll get money that Nominum is getting such high performance by >> doing the kind of optimizations I'm talking about. > > That could be. That is, their failover implementation may not follow the > draft standard. However, if they were going to go non-standard, why not > develop their own mechanism entirely? But I will ask them about this. I'm sure that they developed their own standard for communication between Nominum servers. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html