I know this isn’t a question ABOUT BIND, per se, but I think is still a 
question bind-users might have an answer to. I’ve seen various failover 
questions on the list, but nothing that talks specifically about NS records (at 
least nothing in the last decade), so I thought I’d inquire here.


I’m familiar with round-robin DNS and using multiple A records for the same 
name. I also understand that most clients, if the top server on the list 
doesn’t respond, will wait ~30 seconds before trying the next address on the 
list. This is pretty good, as far as automatic failover goes, but still, having 
X% of your users (X being down servers / all A records offered) wait an extra 
30 seconds is not great so I’m going to run a regular health check on my front 
facing web servers from each BIND server and, if a server stops responding, 
change my zone file and reload until the server starts responding again, 
reversing the process. Then X% of my users will only need to wait 30 seconds 
until I fix the zone file (TTL will also be about the same frequency as the 
health checks so worst case scenario will be 2xTTL for X% of users having to 
wait those extra 30 seconds). Overall I’m satisfied with this balance between 
complexity and resiliency, particularly considering I can do record 
manipulation in advance of planned maintenance and then this problem only 
becomes an issue during unexpected outages.


This is all well and good until I think about failure or maintenance of the 
name servers, themselves. I’ll need to give my registrar my NS IPs for my 
domain but they will not be nearly as flexible regarding changes as I am 
running my own nameservers (TTL will probably be an hour, at the very least) 
which makes maintenance work a MUCH longer process for set-up and tear-down, if 
I have to make NS record changes in coordination with my registrar. However, 
this made me wonder, is NS failure responded to in the same way as the failure 
of an A record? Various Internet randos have indicated some DNS clients and 
resolvers will do parallel lookups and take the first response and others have 
indicated that the “try the next record” parameter for NS comms is 5 to 10 
seconds rather than 30 and still others claim it’s the same as A record 
failover at 30 seconds before trying the next candidate on the list. Is there a 
definitive answer to this or, because it’s client related, are the answers too 
widely varied to rely upon (which is why the answers on the Internet are all 
over the map)?


Failures aside, I’m worried about creating a bad user experience EVERY time I 
need to take a DNS server down for patching. I can’t be the first person to run 
into this problem. Is it just something people live with (and shuffle NS 
records around all the time) or is NS failover really smoother than A record 
failover and I should concentrate on keeping my A records current in case of 
failure OR planned maintenance?


Any feedback would be greatly appreciated.


Thanks,


Scott

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to