I am writing to reply to the previous comments: 1. I have no comment about whether or not the failure and the response to it are or are not "professional."
2. The failure was directly caused by a double hardware failure -- a problem in Boulder and an unrelated problem in Fort Collins at essentially the same time. It is trivially easy to design an algorithm that would have detected this exact problem after it has occurred, but it would be much more difficult to have done so beforehand. The fact that the system did not have this particular algorithm to deal with this particular double failure is not a bug in the usual sense of that word. 3. We have a lot of safeguards built into the systems and we have a lot of experience running them. We have been running time servers for about 20 years, and I can't remember the last failure of this magnitude. I have made some changes to deal with the problem of the relative unreliability of the backup systems in Fort Collins, and this particular problem will not happen again. But it would be foolish of me to promise that the system is perfect and that some other failure, *with equally serious impact*, will never happen again. We have more than 100 computers in the network and lots of ancillary stuff, and it would be foolish and simplistic of me to guarantee that I (or anyone else) have thought of every possible hardware failure. 4. The "unhealthy" flag in NTP (both leap second bits set) is a copy of an internal private kernel parameter. This parameter can be set by a number of internal check processes (which are outside of NTP and independent of it) and it can also be set from Boulder if the central controller detects a problem. A complete failure of the ACTS system would have set the unhealthy flag unconditionally, but the partial failure that actually occurred may not do so. The same kernel parameter is used to control the status parameters of the other non- NTP services that we provide. 5. Since hardware failures are probably inevitable in a network system of the size and complexity of the NIST service, a fair question is whether the failure can be limited and its impact contained or ideally made invisible to the users. The failure affected 11 of the 35 physical time servers that I operate. So the glass starts out about 1/3 empty and 2/3 full. About half of the 11 physical servers were transmitting the unhealthy status and should not have caused any problems for users who parse the flags. So, even during the worst failure in my memory, the glass is about 1/7 empty and 6/7 full. Judah Levine Time and Frequency Division NIST Boulder _______________________________________________ questions mailing list questions@lists.ntp.org http://lists.ntp.org/listinfo/questions