Re: ultradns reachability
In a message written on Fri, Jul 02, 2004 at 05:55:13PM -0700, Matt Ghali wrote: DNS traffic, surprisingly, is not very fat. It is no HTTP nor SMTP. The engineering behind appropriately sizing a unicast fallback would be pretty trivial, especially compared to building a somewhat-robust anycast architecture. This statement may be true for many DNS servers, but I suspect it is completely false for the roots, or for the GTLD's. Perhaps the folks from .org or from f-root would like to comment on how hard it would be to handle the whole load from a single box, particularly when you consider they are all high profile DDoS targets as well. If it were trivial, more GTLD's would be doing it. -- Leo Bicknell - [EMAIL PROTECTED] - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ Read TMBG List - [EMAIL PROTECTED], www.tmbg.org pgpIUv2wbknuR.pgp Description: PGP signature
Re: ultradns reachability
On Fri, 2 Jul 2004, Stephen J. Wilcox wrote: 10.1.0.1 Anycast1 (x50 boxes) 10.2.0.1 Anycast2 (x50 boxes - different to anycast1) In each scenario two systems have to fail to take out any one customer.. but isnt the bottom one better for the usual pro anycast reasons? Correct, and that's what's done whenever engineering triumphs over marketing. The problem is that there's always a temptation to put instances of both clouds at a single physical location, but that's sabotaging yourself, since then the attack which takes down one will take down the other as well. With DNS, it really makes sense to do what you're suggesting, since DNS has its own internal load-balancing function, and having two separate clouds just means that you're giving both the anycast and the DNS client load-balancing algorithms a chance to work. With pretty much any other protocol (except peer-to-peer clients, which also mostly do client-side load balancing) there's a big temptation to have a single huge cloud that appears in as many places as possible. -Bill
Re: ultradns reachability
On 2 Jul 2004, at 00:18, Christopher L. Morrow wrote: So, I thought of it like this: 1) Rodney/Centergate/UltraDNS knows where all their 35000billion copies of the 2 .org TLD boxes are, what network pieces they are connected to at which bandwidths and the current utilization 2) Rodney/Centergate/UltraDNS knows which boxes in each location (there could be multiple inside each pod, right?) are running their dns process and answering at which rates 3) Rodney/Centergate/UltraDNS knows when processes die and locally stop pushing requests to said system inside the pod 4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no systmes responding inside the local pod) so they can stop routing the /24 from that pod's location So, Rodney/Centergate/UltraDNS should know almost exactly when they have a problem they can term 'critical'... I most probably left out some steps above, like wedged proceseses or loss of outbound routing to prefixes sending reqeusts. I'm sure Paul/ISC has a fairly complete list of failure modes for anycast DNS services. All the failure modes that ISC has seen with anycast nameserver instances can be avoided (for the authoritative DNS service as a whole) by including one or more non-anycast nameservers in the NS set. This leaves the anycast servers providing all the optimisation that they are good for (local nameserver in toplogically distant networks; distributed DDoS traffic sink; reduced transaction RTT) and provides a fall-back in case of effective reachability problems for the anycast nameservers. This is so trivial, I continue to be amazed that PIR hasn't done it. The problem then becomes the Hey, .org is dead! From where is it dead? What pod are you seeing it dead from? Is it routing TO the pod from you? FROM the pod to you? The pod itself? Stuck/stale routing information somewhere on the path(s)? This is very complex, or seems to be to me :( With the fix above, the problem becomes hey, *some* of the nameservers for ORG are dead! We should fix that, but since not *all* of them are dead, at least ORG still works. I think more failure modes will be investigated before that comes :) fortunately lots of people are already investigating these, eh? I don't know about lots, but I know of a few. None of the people I know of are using an entire production TLD as their test-bed, however. Joe
Re: ultradns reachability
In a message written on Fri, Jul 02, 2004 at 10:22:09AM -0400, Joe Abley wrote: This leaves the anycast servers providing all the optimisation that they are good for (local nameserver in toplogically distant networks; distributed DDoS traffic sink; reduced transaction RTT) and provides a fall-back in case of effective reachability problems for the anycast nameservers. This is so trivial, I continue to be amazed that PIR hasn't done it. I talked to Rodney about this a long time ago, as well as a few other people. What in practice seems simple is complicated by some of the software that is out there. See: http://www.nanog.org/mtg-0310/pdf/wessels.pdf Note in the later pages what happens to particular servers under packet loss. They all start to show an affinity for a subset of the servers. It's been said that by putting some non-anycasted servers in with the anycasted servers what can happen is if the anycast has issues many things will latch on to the non-anycasted servers and not go back even when the anycast is fixed. How serious this is for something like .org I have no idea, but it's clear all the software has issues, and until they are fixed I don't think this is just a slam dunk. -- Leo Bicknell - [EMAIL PROTECTED] - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ Read TMBG List - [EMAIL PROTECTED], www.tmbg.org pgpl7qJSgADVu.pgp Description: PGP signature
Re: ultradns reachability
On Fri, 2 Jul 2004 10:22:09 -0400, Joe Abley wrote: With the fix above, the problem becomes hey, *some* of the nameservers for ORG are dead! We should fix that, but since not *all* of them are dead, at least ORG still works. Sorry, I missed the top of this thread. I cannot mail an ORG correspondent since ORG domain lookups fail.What is happening and is there a workaround? Thanks for any help Jeffrey Race
Re: ultradns reachability
On 2 Jul 2004, at 10:43, Leo Bicknell wrote: Note in the later pages what happens to particular servers under packet loss. They all start to show an affinity for a subset of the servers. It's been said that by putting some non-anycasted servers in with the anycasted servers what can happen is if the anycast has issues many things will latch on to the non-anycasted servers and not go back even when the anycast is fixed. In my opinion, the primary purpose of anycast distribution of nameservers is reliability of the service as a whole, and not performance. Being able to reach a server is much more important than whether you can get a reply from a particular server in 10ms or 500ms. So, I think the issue you mention (which is certainly mention-worthy) is a much smaller problem than the apparently observed problem of all nameservers in the NS set being unavailable. Joe
Re: ultradns reachability
On Fri, 2 Jul 2004, Joe Abley wrote: All the failure modes that ISC has seen with anycast nameserver instances can be avoided (for the authoritative DNS service as a whole) by including one or more non-anycast nameservers in the NS set. Am I missing something.. So you say: 10.1.0.1 Anycast (x50 boxes) 10.2.0.1 Non-anycast is somehow different from 10.1.0.1 Anycast1 (x50 boxes) 10.2.0.1 Anycast2 (x50 boxes - different to anycast1) In each scenario two systems have to fail to take out any one customer.. but isnt the bottom one better for the usual pro anycast reasons? Steve
Re: ultradns reachability
On Fri, 2 Jul 2004, Leo Bicknell wrote: So the question is not so much is 500ms towards the server bad, it's can I build a single server (cluster) that will take all the load worldwide when the client software does bad things. DNS traffic, surprisingly, is not very fat. It is no HTTP nor SMTP. The engineering behind appropriately sizing a unicast fallback would be pretty trivial, especially compared to building a somewhat-robust anycast architecture. matto
ultradns reachability
is anyone else seeing timeouts reaching ultradns' .org nameservers? I'm seeing seemingly random timeout failures from both sbci and uc berkeley.
RE: ultradns reachability
Well http://www.dnsstuff.com is showing it also (http://www.dnsstuff.com/tools/lookup.ch?name=mariners.orgtype=A) How I am searching: Searching for A record for mariners.org at j.root-servers.net: Got referral to TLD2.ULTRADNS.NET. [took 93 ms] Searching for A record for mariners.org at TLD2.ULTRADNS.NET.: Timed out. Trying again. Searching for A record for mariners.org at TLD2.ULTRADNS.NET.: Timed out. Trying again. Searching for A record for mariners.org at TLD2.ULTRADNS.NET.: Timed out. Trying again. Searching for A record for mariners.org at TLD1.ULTRADNS.NET.: Timed out. Trying again. Searching for A record for mariners.org at TLD2.ULTRADNS.NET.: Got referral to NS2.DIGISLE.NET. [took 473 ms] Searching for A record for mariners.org at NS2.DIGISLE.NET.: Reports mariners.org. [took 49 ms] Answer: Domain Type Class TTL Answer mariners.org. A IN 3600 216.74.142.14 mariners.org. NS IN 3600 ns1.digisle.net. mariners.org. NS IN 3600 ns2.digisle.net. ns1.digisle.net. A IN 172800 167.216.250.42 ns2.digisle.net. A IN 172800 167.216.193.233 -C -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Ghali Sent: Thursday, July 01, 2004 7:01 PM To: [EMAIL PROTECTED] Subject: ultradns reachability is anyone else seeing timeouts reaching ultradns' .org nameservers? I'm seeing seemingly random timeout failures from both sbci and uc berkeley.
Re: ultradns reachability
Once upon a time, Matt Ghali [EMAIL PROTECTED] said: is anyone else seeing timeouts reaching ultradns' .org nameservers? I'm seeing seemingly random timeout failures from both sbci and uc berkeley. One is working and one is not from here. $ dig +norec @tld1.ultradns.net whoareyou.ultradns.net in a ; DiG 8.4 +norec @tld1.ultradns.net whoareyou.ultradns.net in a ; (1 server found) ;; res options: init defnam dnsrch ;; res_nsend: Connection timed out $ dig +norec @tld2.ultradns.net whoareyou.ultradns.net in a ; DiG 8.4 +norec @tld2.ultradns.net whoareyou.ultradns.net in a ; (1 server found) ;; res options: init defnam dnsrch ;; got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 33271 ;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUERY SECTION: ;; whoareyou.ultradns.net, type = A, class = IN ;; ANSWER SECTION: whoareyou.ultradns.net. 0S IN A 204.74.105.6 ;; Total query time: 403 msec ;; FROM: ant.hiwaay.net to SERVER: 204.74.113.1 ;; WHEN: Thu Jul 1 20:10:28 2004 ;; MSG SIZE sent: 40 rcvd: 56 $ -- Chris Adams [EMAIL PROTECTED] Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble.
Re: ultradns reachability
Yes, it looks like it is starting to get back to normal since I got your email :) As far as I could tell it started around 5:30 PST and ended around 6:00 PST. Thanks, Eric At 06:01 PM 7/1/2004, Matt Ghali wrote: is anyone else seeing timeouts reaching ultradns' .org nameservers? I'm seeing seemingly random timeout failures from both sbci and uc berkeley.
Re: ultradns reachability
http://www.cymru.com/DNS/gtlddns-o.html signature.asc Description: This is a digitally signed message part
Re: ultradns reachability
On Thu, 1 Jul 2004, James Edwards wrote: http://www.cymru.com/DNS/gtlddns-o.html my mrtg skillz are kind of lame, but this seems to show 2/3rds outage from this monitoring point of view. It'd be nice if the aforementioned 'what/where/who' info was available for each monitoring point CYMRU uses... So you could tell that from the SBC POV you were querying the XO westcoast pod, from the APPS POV you saw the Verio CHI pod and from the ATT POV you saw the ATT local pod. Anycast makes the pinpointing of problems a little challenging from the external perspective it seems to me.
Re: ultradns reachability
On Fri, Jul 02, 2004 at 02:06:59AM +, Christopher L. Morrow wrote: On Thu, 1 Jul 2004, James Edwards wrote: http://www.cymru.com/DNS/gtlddns-o.html my mrtg skillz are kind of lame, but this seems to show 2/3rds outage from this monitoring point of view. It'd be nice if the aforementioned 'what/where/who' info was available for each monitoring point CYMRU uses... So you could tell that from the SBC POV you were querying the XO westcoast pod, from the APPS POV you saw the Verio CHI pod and from the ATT POV you saw the ATT local pod. Anycast makes the pinpointing of problems a little challenging from the external perspective it seems to me. i am relieved it is only 'a little challenging' because i was worried it was 'sub-possible'. (or am i misinterpreting operational euphemisms...) if we use the routing system to hide reality, we give up transparency in exchange for vigor. it's unclear to me that we even know how to quantify much less measure that tradeoff. like so many other complexity tradeoffs.. but then we've taken similar risks before and gotten stuff like BGP so maybe we'll be, um, just as fond of anycast in due time. :) k // they call this war a cloud over the land. but they made the weather and then they stand in the rain and say 'sh*t its raining!'. -- renéezellweger, 'cold mountain' //
Re: ultradns reachability
On Thu, 1 Jul 2004, k claffy wrote: On Fri, Jul 02, 2004 at 02:06:59AM +, Christopher L. Morrow wrote: On Thu, 1 Jul 2004, James Edwards wrote: http://www.cymru.com/DNS/gtlddns-o.html Anycast makes the pinpointing of problems a little challenging from the external perspective it seems to me. i am relieved it is only 'a little challenging' because i was worried it was 'sub-possible'. (or am i misinterpreting operational euphemisms...) Oops, I did it again, I forgot the :). So, I thought of it like this: 1) Rodney/Centergate/UltraDNS knows where all their 35000billion copies of the 2 .org TLD boxes are, what network pieces they are connected to at which bandwidths and the current utilization 2) Rodney/Centergate/UltraDNS knows which boxes in each location (there could be multiple inside each pod, right?) are running their dns process and answering at which rates 3) Rodney/Centergate/UltraDNS knows when processes die and locally stop pushing requests to said system inside the pod 4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no systmes responding inside the local pod) so they can stop routing the /24 from that pod's location So, Rodney/Centergate/UltraDNS should know almost exactly when they have a problem they can term 'critical'... I most probably left out some steps above, like wedged proceseses or loss of outbound routing to prefixes sending reqeusts. I'm sure Paul/ISC has a fairly complete list of failure modes for anycast DNS services. The problem then becomes the Hey, .org is dead! From where is it dead? What pod are you seeing it dead from? Is it routing TO the pod from you? FROM the pod to you? The pod itself? Stuck/stale routing information somewhere on the path(s)? This is very complex, or seems to be to me :( A good thing, oddly enough, is each of these events gives everyone more and better information about the failure modes :) but then we've taken similar risks before and gotten stuff like BGP so maybe we'll be, um, just as fond of anycast in due time. :) I think more failure modes will be investigated before that comes :) fortunately lots of people are already investigating these, eh? -Chris
Re: ultradns reachability
CLM Date: Fri, 02 Jul 2004 04:18:07 + (GMT) CLM From: Christopher L. Morrow [ editted for brevity -- some punctuation/wording modified ] CLM So, I thought of it like this. Rodney/Centergate/UltraDNS CLM knows: [ snip enumeration ] CLM [and] should know almost exactly when they have a problem CLM they can term 'critical'... One essentially has a DNS network on top of IP network. Looks like O(N) with centralized monitoring, although it could approach O(N^2) if each server/pod cross-monitored all the others. CLM The problem then becomes the Hey, .org is dead! From where CLM is it dead? What pod are you seeing it dead from? Is it CLM routing TO the pod from you? FROM the pod to you? The pod CLM itself? Stuck/stale routing information somewhere on the CLM path(s)? This is very complex, or seems to be to me :( I find your perception of complexity ironic. Yes, there's a good deal of splay. However, I suspect a network the size of UU also has a fair amount of peering splay, with a couple downstreams thrown in for good measure. ;-) However, I agree anycast has additional design implications: * Should servers/pods talk among themselves using mcast along pairs that follow L3 topology? Should N servers/pods each communicate with (N / 2 + 1) others, ignoring L3 topology? Fast poll the former and slow poll the latter? * If servers/pods communicate among themselves, should they use unicast addresses? anycast addresses? anycast addresses tunneled through unicast? * Each pod a stub? Each pod interconnected with an OOB OAM network? All pods interconnected with sizable backbone? Does multicast serve a purpose? Eddy -- EverQuick Internet - http://www.everquick.net/ A division of Brotsman Dreger, Inc. - http://www.brotsman.com/ Bandwidth, consulting, e-commerce, hosting, and network building Phone: +1 785 865 5885 Lawrence and [inter]national Phone: +1 316 794 8922 Wichita _ DO NOT send mail to the following addresses: [EMAIL PROTECTED] -*- [EMAIL PROTECTED] -*- [EMAIL PROTECTED] Sending mail to spambait addresses is a great way to get blocked.