Re: ultradns reachability

2004-07-03 Thread Leo Bicknell
In a message written on Fri, Jul 02, 2004 at 05:55:13PM -0700, Matt Ghali wrote:
 DNS traffic, surprisingly, is not very fat. It is no HTTP nor SMTP.
 
 The engineering behind appropriately sizing a unicast fallback would
 be pretty trivial, especially compared to building a somewhat-robust
 anycast architecture.

This statement may be true for many DNS servers, but I suspect it
is completely false for the roots, or for the GTLD's.  Perhaps the
folks from .org or from f-root would like to comment on how hard
it would be to handle the whole load from a single box, particularly
when you consider they are all high profile DDoS targets as well.

If it were trivial, more GTLD's would be doing it.

-- 
   Leo Bicknell - [EMAIL PROTECTED] - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/
Read TMBG List - [EMAIL PROTECTED], www.tmbg.org


pgpIUv2wbknuR.pgp
Description: PGP signature


Re: ultradns reachability

2004-07-03 Thread Bill Woodcock

  On Fri, 2 Jul 2004, Stephen J. Wilcox wrote:
 10.1.0.1 Anycast1 (x50 boxes)
 10.2.0.1 Anycast2 (x50 boxes - different to anycast1)
 In each scenario two systems have to fail to take out any one customer.. but
 isnt the bottom one better for the usual pro anycast reasons?

Correct, and that's what's done whenever engineering triumphs over
marketing.  The problem is that there's always a temptation to put
instances of both clouds at a single physical location, but that's
sabotaging yourself, since then the attack which takes down one will take
down the other as well.

With DNS, it really makes sense to do what you're suggesting, since DNS
has its own internal load-balancing function, and having two separate
clouds just means that you're giving both the anycast and the DNS client
load-balancing algorithms a chance to work.  With pretty much any other
protocol (except peer-to-peer clients, which also mostly do client-side
load balancing) there's a big temptation to have a single huge cloud that
appears in as many places as possible.

-Bill




Re: ultradns reachability

2004-07-02 Thread Joe Abley

On 2 Jul 2004, at 00:18, Christopher L. Morrow wrote:
So, I thought of it like this:
1) Rodney/Centergate/UltraDNS knows where all their 35000billion 
copies of
the 2 .org TLD boxes are, what network pieces they are connected to at
which bandwidths and the current utilization
2) Rodney/Centergate/UltraDNS knows which boxes in each location (there
could be multiple inside each pod, right?) are running their dns 
process
and answering at which rates
3) Rodney/Centergate/UltraDNS knows when processes die and locally stop
pushing requests to said system inside the pod
4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no
systmes responding inside the local pod) so they can stop routing the 
/24
from that pod's location

So, Rodney/Centergate/UltraDNS should know almost exactly when they 
have a
problem they can term 'critical'... I most probably left out some steps
above, like wedged proceseses or loss of outbound routing to prefixes
sending reqeusts. I'm sure Paul/ISC has a fairly complete list of 
failure
modes for anycast DNS services.
All the failure modes that ISC has seen with anycast nameserver 
instances can be avoided (for the authoritative DNS service as a whole) 
by including one or more non-anycast nameservers in the NS set.

This leaves the anycast servers providing all the optimisation that 
they are good for (local nameserver in toplogically distant networks; 
distributed DDoS traffic sink; reduced transaction RTT) and provides a 
fall-back in case of effective reachability problems for the anycast 
nameservers.

This is so trivial, I continue to be amazed that PIR hasn't done it.
The problem then becomes the Hey, .org is dead! From where is it 
dead?
What pod are you seeing it dead from? Is it routing TO the pod from 
you?
FROM the pod to you? The pod itself? Stuck/stale routing information
somewhere on the path(s)? This is very complex, or seems to be to me :(
With the fix above, the problem becomes hey, *some* of the nameservers 
for ORG are dead! We should fix that, but since not *all* of them are 
dead, at least ORG still works.

I think more failure modes will be investigated before that comes :)
fortunately lots of people are already investigating these, eh?
I don't know about lots, but I know of a few. None of the people I know 
of are using an entire production TLD as their test-bed, however.

Joe


Re: ultradns reachability

2004-07-02 Thread Leo Bicknell
In a message written on Fri, Jul 02, 2004 at 10:22:09AM -0400, Joe Abley wrote:
 This leaves the anycast servers providing all the optimisation that 
 they are good for (local nameserver in toplogically distant networks; 
 distributed DDoS traffic sink; reduced transaction RTT) and provides a 
 fall-back in case of effective reachability problems for the anycast 
 nameservers.
 
 This is so trivial, I continue to be amazed that PIR hasn't done it.

I talked to Rodney about this a long time ago, as well as a few
other people.  What in practice seems simple is complicated by some
of the software that is out there.  See:

http://www.nanog.org/mtg-0310/pdf/wessels.pdf

Note in the later pages what happens to particular servers under
packet loss.  They all start to show an affinity for a subset of
the servers.  It's been said that by putting some non-anycasted
servers in with the anycasted servers what can happen is if the
anycast has issues many things will latch on to the non-anycasted
servers and not go back even when the anycast is fixed.

How serious this is for something like .org I have no idea, but it's
clear all the software has issues, and until they are fixed I don't
think this is just a slam dunk.

-- 
   Leo Bicknell - [EMAIL PROTECTED] - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/
Read TMBG List - [EMAIL PROTECTED], www.tmbg.org


pgpl7qJSgADVu.pgp
Description: PGP signature


Re: ultradns reachability

2004-07-02 Thread Dr. Jeffrey Race

On Fri, 2 Jul 2004 10:22:09 -0400, Joe Abley wrote:
With the fix above, the problem becomes hey, *some* of the nameservers 
for ORG are dead! We should fix that, but since not *all* of them are 
dead, at least ORG still works.

Sorry, I missed the top of this thread.  I cannot mail an ORG correspondent since
ORG domain lookups fail.What is happening and is there a workaround?  

Thanks for any help

Jeffrey Race





Re: ultradns reachability

2004-07-02 Thread Joe Abley

On 2 Jul 2004, at 10:43, Leo Bicknell wrote:
Note in the later pages what happens to particular servers under
packet loss.  They all start to show an affinity for a subset of
the servers.  It's been said that by putting some non-anycasted
servers in with the anycasted servers what can happen is if the
anycast has issues many things will latch on to the non-anycasted
servers and not go back even when the anycast is fixed.
In my opinion, the primary purpose of anycast distribution of 
nameservers is reliability of the service as a whole, and not 
performance. Being able to reach a server is much more important than 
whether you can get a reply from a particular server in 10ms or 500ms.

So, I think the issue you mention (which is certainly mention-worthy) 
is a much smaller problem than the apparently observed problem of all 
nameservers in the NS set being unavailable.

Joe


Re: ultradns reachability

2004-07-02 Thread Stephen J. Wilcox

On Fri, 2 Jul 2004, Joe Abley wrote:

 All the failure modes that ISC has seen with anycast nameserver 
 instances can be avoided (for the authoritative DNS service as a whole) 
 by including one or more non-anycast nameservers in the NS set.

Am I missing something..

So you say:

10.1.0.1 Anycast (x50 boxes)
10.2.0.1 Non-anycast

is somehow different from 

10.1.0.1 Anycast1 (x50 boxes)
10.2.0.1 Anycast2 (x50 boxes - different to anycast1)

In each scenario two systems have to fail to take out any one customer.. but 
isnt the bottom one better for the usual pro anycast reasons?

Steve



Re: ultradns reachability

2004-07-02 Thread Matt Ghali

On Fri, 2 Jul 2004, Leo Bicknell wrote:

 So the question is not so much is 500ms towards the server
 bad, it's can I build a single server (cluster) that will take
 all the load worldwide when the client software does bad things.

DNS traffic, surprisingly, is not very fat. It is no HTTP nor SMTP.

The engineering behind appropriately sizing a unicast fallback would
be pretty trivial, especially compared to building a somewhat-robust
anycast architecture.

matto


ultradns reachability

2004-07-01 Thread Matt Ghali

is anyone else seeing timeouts reaching ultradns' .org nameservers?

I'm seeing seemingly random timeout failures from both sbci and uc berkeley.


RE: ultradns reachability

2004-07-01 Thread Cody Lerum

Well

http://www.dnsstuff.com is showing it also
(http://www.dnsstuff.com/tools/lookup.ch?name=mariners.orgtype=A)

How I am searching:
Searching for A record for mariners.org at j.root-servers.net:  Got
referral to TLD2.ULTRADNS.NET. [took 93 ms]
Searching for A record for mariners.org at TLD2.ULTRADNS.NET.:  Timed
out.  Trying again.
Searching for A record for mariners.org at TLD2.ULTRADNS.NET.:  Timed
out.  Trying again.
Searching for A record for mariners.org at TLD2.ULTRADNS.NET.:  Timed
out.  Trying again.
Searching for A record for mariners.org at TLD1.ULTRADNS.NET.:  Timed
out.  Trying again.
Searching for A record for mariners.org at TLD2.ULTRADNS.NET.:  Got
referral to NS2.DIGISLE.NET. [took 473 ms]
Searching for A record for mariners.org at NS2.DIGISLE.NET.:  Reports
mariners.org. [took 49 ms]

Answer:


Domain Type Class TTL Answer mariners.org. A IN 3600 216.74.142.14
mariners.org. NS IN 3600 ns1.digisle.net. mariners.org. NS IN 3600
ns2.digisle.net. ns1.digisle.net. A IN 172800 167.216.250.42
ns2.digisle.net. A IN 172800 167.216.193.233  


-C

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
Matt Ghali
Sent: Thursday, July 01, 2004 7:01 PM
To: [EMAIL PROTECTED]
Subject: ultradns reachability


is anyone else seeing timeouts reaching ultradns' .org nameservers?

I'm seeing seemingly random timeout failures from both sbci and uc
berkeley.




Re: ultradns reachability

2004-07-01 Thread Chris Adams

Once upon a time, Matt Ghali [EMAIL PROTECTED] said:
 is anyone else seeing timeouts reaching ultradns' .org nameservers?
 
 I'm seeing seemingly random timeout failures from both sbci and uc berkeley.

One is working and one is not from here.

$ dig +norec @tld1.ultradns.net whoareyou.ultradns.net in a

;  DiG 8.4  +norec @tld1.ultradns.net whoareyou.ultradns.net in a 
; (1 server found)
;; res options: init defnam dnsrch
;; res_nsend: Connection timed out
$ dig +norec @tld2.ultradns.net whoareyou.ultradns.net in a

;  DiG 8.4  +norec @tld2.ultradns.net whoareyou.ultradns.net in a 
; (1 server found)
;; res options: init defnam dnsrch
;; got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 33271
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;;  whoareyou.ultradns.net, type = A, class = IN

;; ANSWER SECTION:
whoareyou.ultradns.net.  0S IN A  204.74.105.6

;; Total query time: 403 msec
;; FROM: ant.hiwaay.net to SERVER: 204.74.113.1
;; WHEN: Thu Jul  1 20:10:28 2004
;; MSG SIZE  sent: 40  rcvd: 56
$

-- 
Chris Adams [EMAIL PROTECTED]
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.


Re: ultradns reachability

2004-07-01 Thread Eric Frazier
Yes, it looks like it is starting to get back to normal since I got your 
email :)

As far as I could tell it started around 5:30 PST and ended around 6:00 PST.
Thanks,
Eric
At 06:01 PM 7/1/2004, Matt Ghali wrote:
is anyone else seeing timeouts reaching ultradns' .org nameservers?
I'm seeing seemingly random timeout failures from both sbci and uc berkeley.



Re: ultradns reachability

2004-07-01 Thread James Edwards
http://www.cymru.com/DNS/gtlddns-o.html


signature.asc
Description: This is a digitally signed message part


Re: ultradns reachability

2004-07-01 Thread Christopher L. Morrow


On Thu, 1 Jul 2004, James Edwards wrote:

 http://www.cymru.com/DNS/gtlddns-o.html


my mrtg skillz are kind of lame, but this seems to show 2/3rds outage from
this monitoring point of view. It'd be nice if the aforementioned
'what/where/who' info was available for each monitoring point CYMRU
uses... So you could tell that from the SBC POV you were querying the XO
westcoast pod, from the APPS POV you saw the Verio CHI pod and from the
ATT POV you saw the ATT local pod.

Anycast makes the pinpointing of problems a little challenging from the
external perspective it seems to me.


Re: ultradns reachability

2004-07-01 Thread k claffy

On Fri, Jul 02, 2004 at 02:06:59AM +, Christopher L. Morrow wrote:
  
  
  On Thu, 1 Jul 2004, James Edwards wrote:
  
   http://www.cymru.com/DNS/gtlddns-o.html
  
  
  my mrtg skillz are kind of lame, but this seems to show 2/3rds outage from
  this monitoring point of view. It'd be nice if the aforementioned
  'what/where/who' info was available for each monitoring point CYMRU
  uses... So you could tell that from the SBC POV you were querying the XO
  westcoast pod, from the APPS POV you saw the Verio CHI pod and from the
  ATT POV you saw the ATT local pod.
  
  Anycast makes the pinpointing of problems a little challenging from the
  external perspective it seems to me.

i am relieved it is only 'a little challenging' 
because i was worried it was 'sub-possible'.  
(or am i misinterpreting operational euphemisms...)

if we use the routing system to hide reality,
we give up transparency in exchange for vigor.  
it's unclear to me that we even know how to quantify 
much less measure that tradeoff.   like so many other
complexity tradeoffs.. 
but then we've taken similar risks before and gotten 
stuff like BGP so maybe we'll be, um, just as fond of 
anycast in due time. :)

k
//
they call this war a cloud over the land. but they made the 
weather and then they stand in the rain and say 'sh*t its raining!'.
-- renéezellweger, 'cold mountain' 
//


Re: ultradns reachability

2004-07-01 Thread Christopher L. Morrow


On Thu, 1 Jul 2004, k claffy wrote:

 On Fri, Jul 02, 2004 at 02:06:59AM +, Christopher L. Morrow wrote:
   On Thu, 1 Jul 2004, James Edwards wrote:
http://www.cymru.com/DNS/gtlddns-o.html
   
   Anycast makes the pinpointing of problems a little challenging from the
   external perspective it seems to me.

 i am relieved it is only 'a little challenging'
 because i was worried it was 'sub-possible'.
 (or am i misinterpreting operational euphemisms...)

Oops, I did it again, I forgot the :).

So, I thought of it like this:
1) Rodney/Centergate/UltraDNS knows where all their 35000billion copies of
the 2 .org TLD boxes are, what network pieces they are connected to at
which bandwidths and the current utilization
2) Rodney/Centergate/UltraDNS knows which boxes in each location (there
could be multiple inside each pod, right?) are running their dns process
and answering at which rates
3) Rodney/Centergate/UltraDNS knows when processes die and locally stop
pushing requests to said system inside the pod
4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no
systmes responding inside the local pod) so they can stop routing the /24
from that pod's location

So, Rodney/Centergate/UltraDNS should know almost exactly when they have a
problem they can term 'critical'... I most probably left out some steps
above, like wedged proceseses or loss of outbound routing to prefixes
sending reqeusts. I'm sure Paul/ISC has a fairly complete list of failure
modes for anycast DNS services.

The problem then becomes the Hey, .org is dead! From where is it dead?
What pod are you seeing it dead from? Is it routing TO the pod from you?
FROM the pod to you? The pod itself? Stuck/stale routing information
somewhere on the path(s)? This is very complex, or seems to be to me :(

A good thing, oddly enough, is each of these events gives everyone more
and better information about the failure modes :)


 but then we've taken similar risks before and gotten
 stuff like BGP so maybe we'll be, um, just as fond of
 anycast in due time. :)


I think more failure modes will be investigated before that comes :)
fortunately lots of people are already investigating these, eh?

-Chris


Re: ultradns reachability

2004-07-01 Thread Edward B. Dreger

CLM Date: Fri, 02 Jul 2004 04:18:07 + (GMT)
CLM From: Christopher L. Morrow

[ editted for brevity -- some punctuation/wording modified ]


CLM So, I thought of it like this.  Rodney/Centergate/UltraDNS
CLM knows:

[ snip enumeration ]


CLM [and] should know almost exactly when they have a problem
CLM they can term 'critical'...

One essentially has a DNS network on top of IP network.  Looks
like O(N) with centralized monitoring, although it could approach
O(N^2) if each server/pod cross-monitored all the others.


CLM The problem then becomes the Hey, .org is dead! From where
CLM is it dead?  What pod are you seeing it dead from? Is it
CLM routing TO the pod from you?  FROM the pod to you? The pod
CLM itself? Stuck/stale routing information somewhere on the
CLM path(s)? This is very complex, or seems to be to me :(

I find your perception of complexity ironic.  Yes, there's a good
deal of splay.  However, I suspect a network the size of UU also
has a fair amount of peering splay, with a couple downstreams
thrown in for good measure. ;-)

However, I agree anycast has additional design implications:

* Should servers/pods talk among themselves using mcast along
  pairs that follow L3 topology?  Should N servers/pods each
  communicate with (N / 2 + 1) others, ignoring L3 topology?
  Fast poll the former and slow poll the latter?

* If servers/pods communicate among themselves, should they use
  unicast addresses? anycast addresses? anycast addresses
  tunneled through unicast?

* Each pod a stub?  Each pod interconnected with an OOB OAM
  network?  All pods interconnected with sizable backbone?  Does
  multicast serve a purpose?


Eddy
--
EverQuick Internet - http://www.everquick.net/
A division of Brotsman  Dreger, Inc. - http://www.brotsman.com/
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
DO NOT send mail to the following addresses:
[EMAIL PROTECTED] -*- [EMAIL PROTECTED] -*- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.