Re: anycast (Re: .ORG problems this evening)

2003-09-22 Thread David G. Andersen

On Thu, Sep 18, 2003 at 02:38:18PM -0400, Todd Vierling quacked:
 
 On Thu, 18 Sep 2003, E.B. Dreger wrote:
 
 : EBD That's why one uses a daemon with main loop including
 : EBD something like:
 : EBD
 : EBDsuccess = 1 ;
 : EBDfor ( i = checklist ; i-callback != NULL ; i++ )
 : EBDsuccess = i-callback(foo) ;
 : EBDif ( success )
 : EBDsend_keepalive(via_some_ipc_mechanism) ;
 
 Yes, I hope that UltraDNS implements something like this, if they have not
 already.  It's still not a guarantee that things will get withdrawn -- or be
 reachable, even if working but not withdrawn -- in case of a problem.  That
 still leaves the DNS for a gTLD at risk for a single point of failure.

The whole problem with only listing two anycast servers is that 
you leave yourself vulnerable to other kinds of faults.  Your
upstream ISP fat-fingers ip route 64.94.110.11 null0 and
accidentally blitzes the netblock from which the anycast servers
are announced.  A router somewhere between customers and the
anycast servers stops forwarding traffic, or starts corrupting
transit data, without interrupting its route processing.
packet filters get misconfigured..

(Observe how divorced route processing and packet processing
are in modern routing architectures and it's pretty easy to
see how this can happen.  With load balancing, traffic
can get routed down a non-functional path while routing
takes place over the other one - BBN did that to us once,
was very entertaining).

Route updates in BGP take a while to propagate.  Much longer
than the 15ms RTT from me to, say, a.root-server.net.  The application
retry in this context can be massively faster than waiting 30+ seconds
for a BGP update interval.

The availability of the DNS is now co-mingled with the success
of the magic route tweak code;  the resulting system is a fair
bit more complex than simply running a bunch of different
DNS servers.   God forbid that zebra ever has bugs...

  http://www.geocrawler.com/lists/3/GNU/372/0/

In contrast, talking to a few DNS servers gives you an end-to-end
test of how well the service is working.  You still depend on the
answers being correct, but you can intuit a lot from whether
or not you actually get answers, instead of sitting around twiddling
your thumbs thinking, gee, I sure wish that routing update would
get sent out so I could use the 'net.

  -Dave

-- 
work: [EMAIL PROTECTED]  me:  [EMAIL PROTECTED]
  MIT Laboratory for Computer Science   http://www.angio.net/
  I do not accept unsolicited commercial email.  Do not spam me.


Re: anycast (Re: .ORG problems this evening)

2003-09-22 Thread Patrick

On Mon, 22 Sep 2003, David G. Andersen wrote:

  Yes, I hope that UltraDNS implements something like this, if they have not
  already.  It's still not a guarantee that things will get withdrawn -- or be
  reachable, even if working but not withdrawn -- in case of a problem.  That
  still leaves the DNS for a gTLD at risk for a single point of failure.

 The whole problem with only listing two anycast servers is that
 you leave yourself vulnerable to other kinds of faults.  Your
 upstream ISP fat-fingers ip route 64.94.110.11 null0 and
 accidentally blitzes the netblock from which the anycast servers
 are announced.  A router somewhere between customers and the
 anycast servers stops forwarding traffic, or starts corrupting
 transit data, without interrupting its route processing.
 packet filters get misconfigured..

That's a good reason to make sure that you are anycasting from at least
two disparate netblocks, isn't it?. :-)


/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
   Patrick Greenwell
 Asking the wrong questions is the leading cause of wrong answers
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/


Re: anycast (Re: .ORG problems this evening)

2003-09-22 Thread E.B. Dreger

DGA Date: Mon, 22 Sep 2003 18:32:19 -0400
DGA From: David G. Andersen


DGA The whole problem with only listing two anycast servers is that
DGA you leave yourself vulnerable to other kinds of faults.  Your
DGA upstream ISP fat-fingers ip route 64.94.110.11 null0 and
DGA accidentally blitzes the netblock from which the anycast servers
DGA are announced.  A router somewhere between customers and the

And this is peculiar to anycast?


DGA anycast servers stops forwarding traffic, or starts corrupting

And this is peculiar to anycast?


DGA transit data, without interrupting its route processing.
DGA packet filters get misconfigured..

And this is peculiar to anycast?


DGA Route updates in BGP take a while to propagate.  Much longer
DGA than the 15ms RTT from me to, say, a.root-server.net.  The application
DGA retry in this context can be massively faster than waiting 30+ seconds
DGA for a BGP update interval.

If a location goes dark, that's a problem.  With redundant
machines locally anycasted and inter-location transport, it
becomes a question of border router and peer reliability.


DGA The availability of the DNS is now co-mingled with the success
DGA of the magic route tweak code;  the resulting system is a fair

The availability of * is co-mingled with the success of the gear
advertising its prefixes.

The difference between standard multihoming and anycast is that
the behind-the-scenes stuff happens to be on different machines
in different locations.


DGA bit more complex than simply running a bunch of different
DGA DNS servers.   God forbid that zebra ever has bugs...
DGA
DGA   http://www.geocrawler.com/lists/3/GNU/372/0/

You assume zebra is the only option.  Sure, it has bugs.  So do
Vendors C, J, and R.


DGA In contrast, talking to a few DNS servers gives you an end-to-end
DGA test of how well the service is working.

So splay is bad?


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: anycast (Re: .ORG problems this evening)

2003-09-22 Thread just me

On Mon, 22 Sep 2003, David G. Andersen wrote:

  With load balancing, traffic can get routed down a non-functional
  path while routing takes place over the other one - BBN did that
  to us once, was very entertaining).

Ah yes, I'll always have a special place in my heart for those
Localdirectors. *cough*


  In contrast, talking to a few DNS servers gives you an end-to-end
  test of how well the service is working.  You still depend on the
  answers being correct, but you can intuit a lot from whether
  or not you actually get answers, instead of sitting around twiddling
  your thumbs thinking, gee, I sure wish that routing update would
  get sent out so I could use the 'net.

Anycast isn't the only thing possibly stuck waiting for routing
convergence... Let's not get carried away here.

matto

[EMAIL PROTECTED]darwin
   Flowers on the razor wire/I know you're here/We are few/And far
   between/I was thinking about her skin/Love is a many splintered
   thing/Don't be afraid now/Just walk on in. #include disclaim.h



Re: .ORG problems this evening

2003-09-19 Thread Alex Bligh


--On 18 September 2003 10:05 -0400 Todd Vierling [EMAIL PROTECTED] wrote:

DNS site A goes down, but its BGP advertisements are still in effect.
(Their firewall still appears to be up, but DNS requests fail.)  Host
site C cannot resolve ANYTHING from DNS site A, even though DNS site B is
still up and running.  But host site C cannot see DNS site B!
What you seem to be missing is that the BGP advert goes away when the DNS
requests stop working.
I have written DNS/BGP code (nothing to do with UltraDNS) and I can tell
you it works very well. Even if you unplug the machine from the net you can
get rapid failover by tweaking a BGP timer here or there. If you are going
to say yes but that means I don't have one of the servers up whilst
routing reconverges this is true, but (a) it happens ANYWAY, (b) as the
prefered route is in general more local, the rainshadow from routing
reconvergence in the event of disruption is smaller.
Alex


apathy (was Re: .ORG problems this evening)

2003-09-19 Thread Todd Vierling

On Fri, 19 Sep 2003, Alex Bligh wrote:

:  DNS site A goes down, but its BGP advertisements are still in effect.
:  (Their firewall still appears to be up, but DNS requests fail.)  Host
:  site C cannot resolve ANYTHING from DNS site A, even though DNS site B is
:  still up and running.  But host site C cannot see DNS site B!
:
: What you seem to be missing is that the BGP advert goes away when the DNS
: requests stop working.

It didn't.  That's the problem.

I've repeatedly described how I do understand the methodology here.  What's
being expressed on this list is blind faith and trust in an anycast-only
gTLD DNS scheme that has the possibility of routing to a single point of
failure.

This scheme has already failed once.  (When will it fail again?)

Established gTLD practice has not put trust in an anycast routing scheme
where one (1) destination might serve all queries for a host.  What I've
tried to express is that the years-established, standard DNS redundancy
failover model could and should be implemented to complement -- not replace
-- this anycast model for something as critical as a Big Three gTLD.

That's fine; I give up due to pervasive community apathy.  When this happens
again, I'll be sure to bring up the archive URL to the head of this thread.

sigh

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


RE: apathy (was Re: .ORG problems this evening)

2003-09-19 Thread Eric Germann



 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
 Todd Vierling
 Sent: Friday, September 19, 2003 11:37 AM
 To: [EMAIL PROTECTED]
 Subject: apathy (was Re: .ORG problems this evening)
 
 
 I've repeatedly described how I do understand the methodology 
 here.  What's
 being expressed on this list is blind faith and trust in an anycast-only
 gTLD DNS scheme that has the possibility of routing to a single point of
 failure.
 

Anyone know if 64.94.110.11 is done via anycast?

 This scheme has already failed once.  (When will it fail again?)


In that case, hopefully soon ...
 




Re: apathy (was Re: .ORG problems this evening)

2003-09-19 Thread Rodney Joffe



Todd Vierling wrote:
 
 On Fri, 19 Sep 2003, Alex Bligh wrote:
 
 :  DNS site A goes down, but its BGP advertisements are still in effect.
 :  (Their firewall still appears to be up, but DNS requests fail.)  Host
 :  site C cannot resolve ANYTHING from DNS site A, even though DNS site B is
 :  still up and running.  But host site C cannot see DNS site B!
 :
 : What you seem to be missing is that the BGP advert goes away when the DNS
 : requests stop working.
 
 It didn't.  That's the problem.
 
 I've repeatedly described how I do understand the methodology here.  What's
 being expressed on this list is blind faith and trust in an anycast-only
 gTLD DNS scheme that has the possibility of routing to a single point of
 failure.
 
 This scheme has already failed once.  (When will it fail again?)
 
 Established gTLD practice has not put trust in an anycast routing scheme
 where one (1) destination might serve all queries for a host.  What I've
 tried to express is that the years-established, standard DNS redundancy
 failover model could and should be implemented to complement -- not replace
 -- this anycast model for something as critical as a Big Three gTLD.
 
 That's fine; I give up due to pervasive community apathy.  When this happens
 again, I'll be sure to bring up the archive URL to the head of this thread.
 
 sigh

You started from a point of having no idea that UltraDNS used anycast,
confirmed for everyone in your second email that you had no clue about
how anycast worked, and migrated by your third email to being an expert
on how it should work. And based on assumptions that were flawed in the
very beginning, you've created a one megabyte thread and a s+n/n ration
almost unparalleled by anything I've ever seen on NANOG before. As I
told you privately, I'm working on a response that tries to deal with
all the misinformation you've spouted. There is so much, however, that
it is taking more than the 10 minutes you took to decide you knew it
all.

So you can call it apathy, or anything else you want. It seems
consistent with your way of jumping to conclusions based on flawed
assumptions. But it's really just that other people actually take time
to research issues before mouthing off.

YMMV, and apparently it does.

In the interim, feel free to post your operational experience and
qualification with tlds and their dns.
-- 
Rodney Joffe
CenterGate Research Group, LLC.
http://www.centergate.com
Technology so advanced, even we don't understand it!(R)


Re: apathy (was Re: .ORG problems this evening)

2003-09-19 Thread Todd Vierling

On Fri, 19 Sep 2003, Rodney Joffe wrote:

: You started from a point of having no idea that UltraDNS used anycast,
: confirmed for everyone in your second email that you had no clue about
: how anycast worked,

Please stop the bellicose, holier-than-thou attitude because you feel like
assuming that I don't have networking experience.  It's getting tiresome.
I apologize for whatever I've done to offend you.

What I didn't know at first was that UltraDNS's system was based on anycast.
Yes, it was my oversight, probably due to my own complacency with the gTLDs
Just Working for so long.  Once I was notified of that fact, my perspective
on the problem changed quite a bit.

I do know how anycast routing works, and that it failed miserably in this
particular case.  The implementation failure specifics are not my concern on
this point; the simple fact is that a critical gTLD resource failed.

Blindly trusting that the all-anycast implementation in use will work
better in the future seemed a rather bad idea to me in the context of a
gTLD.  I was trying to figure out, with the help of others who have been far
more gracious, what possibilities exist that could help keep the failure
from happening again -- outside the scope of this particular anycast
implementation.

: But it's really just that other people actually take time to research
: issues before mouthing off.

Actually, my first few requests for corroborating information (research)
received several mouthing-off responses.  Much of this thread has required
me to fend off rather improper personal attacks -- this one included -- from
people such as yourself, while at the same time attempting to get assistance
to analyze a difficult to see, corner case problem with a critical resource.

I have apologized offlist to a few people whose heated remarks to me
received heated messages in response, and I apologize to all on-list right
now.  That is not appropriate here in either direction.

: In the interim, feel free to post your operational experience

Ultimatum demands like this are just not called for, and I will not be a
party to it.  However, I'm happy to discuss it offlist with anyone who may
be interested; there are business-vs.-personal reasons that I cannot discuss
this on-list.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: apathy (was Re: .ORG problems this evening)

2003-09-19 Thread Richard A Steenbergen

On Fri, Sep 19, 2003 at 01:36:41PM -0400, Todd Vierling wrote:
 
 On Fri, 19 Sep 2003, Rodney Joffe wrote:
 
 : You started from a point of having no idea that UltraDNS used anycast,
 : confirmed for everyone in your second email that you had no clue about
 : how anycast worked,
 
 Please stop the bellicose, holier-than-thou attitude because you feel like
 assuming that I don't have networking experience.  It's getting tiresome.
 I apologize for whatever I've done to offend you.

On behalf of the entire NANOG community, please stop pretending that you
DO have a clue just because you believe people shouldn't assume you don't.
Trust me when I say that is is no longer an assumption.

Please also do not mistake apathy for annoyance at your continued 
incessant whining about anycast and UltraDNS. You don't have anything more 
useful to say, so please do us all a favor and just stop now.

-- 
Richard A Steenbergen [EMAIL PROTECTED]   http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, Jared Mauch wrote:

:   ultradns uses the power of anycast to have these ips that appear
: to be on close subnets in geographyically diverse locations.

Oh, that's brilliant.  How nice of them to defeat the concept of redundancy
by limiting me to only two of their servers for a gTLD.

VeriSign might be doing some loathsome things lately, but at least my named
has several more servers than just two to choose from.

:   could you provide some more technical details, other than
: your postulations that they have two machines on
: network-wise close subnets and that is the problem?

I tracerouted to both IPs from two different locations in the USA; both took
the same route before hitting !H from an ultradns.com rDNS machine.  And
both servers for that route were completely unresponsive from both tried
locations during the outage period.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, Majdi S. Abbas wrote:

:   I didn't have a problem with .org this evening, and I've asked
: around and others don't seem to have noticed anything either.  It would be
: more helpful if you told us your source prefix, and which filter you're
: hitting when you traceroute to tld[12].ultradns.net.

12  dellfweqab.ultradns.net (204.74.103.2)  24.811 ms !H

Same machine for both tld1 and tld2, seen through XO last night and Verio
this morning, from source prefix 66.56.64.0/19 (as well as two others, one
on the US east coast and one in US midwest which I cannot name publicly).

So as far as my machine's source address is concerned, even if the servers
are anycast, there are still only two servers which reside on a single point
of failure.  Anycasting doesn't help me one whit if there are only two
servers for my named to choose and both of the ones visible from my location
are down (even though their routes are up) -- this is IMNSHO irresponsible
for a gTLD operator.

If anycast is the game, there should be much more than just two addresses to
choose.  Ideally, there should be about six, and certain servers should
deliberately *not* advertise certain anycast networks, in an overlap mesh
that allows one point to fail while others still respond.  For instance:

USA server location A advertises networks 1, 3, 5;
USA server location B advertises networks 1, 3, 4;
Europe server location A advertises networks 3, 4, 6;
Asia server location A advertises networks 2, 5, 6;

or something to that effect.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, Stephen J. Wilcox wrote:

: they have two distinct servers by IP, globally they have N x clusters. i'm sure
: each instance is actualyl more than a single linux PeeCee

Doesn't matter if it's a cluster at each location.  The fact remains that
there were only two IP addresses visible to my named, and both were
unresponsive to my machine.  As far as my machine was concerned, .ORG was
down for the count, no matter how many servers, that were invisible to me,
were still working.

: so even if what i see as tld1 now goes into failure.. for the minute or two it
: takes to go offline and reconverge on antoerh tld1 i still see tld2

The routes I saw never went offline, as far as I could tell -- and from my
location tld1 and tld2 have the *same* route and end up at the same physical
connectivity location.  So much for redundancy.

: maybe its firewalled? I see !H too but my .org is working fine for dns resolving

Yes, it is firewalled.  I was pointing out that the route is the same for
tld1 and tld2 for me, all the way up to the firewall.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread Rodney Joffe



Todd Vierling wrote:
 
 Yes, it is firewalled.  I was pointing out that the route is the same for
 tld1 and tld2 for me, all the way up to the firewall.

Please post traceroutes from your location, as well as from the two
locations in different parts of the USA (You said earlier: I
tracerouted to both IPs from two different locations in the USA; both
took the same route before hitting !H from an ultradns.com rDNS machine.
)

Then please post the results of sho ip bgp 204.74.112.1 and sho ip bgp
204.74.113.1 from your location.

Thanks
-- 
Rodney Joffe
CenterGate Research Group, LLC.
http://www.centergate.com
Technology so advanced, even we don't understand it!(SM)


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, just me wrote:

: If you're still confused, have a read here:
:
: http://www.ultradns.com/support/managed_dns_faq.cfm
:
: Q. I read that your service is supposed to make use of several
: servers all over the world, but you only give users two server
: addresses to provide to their registrar. How do I make use of all the
: other servers?

I know what anycast does.  See the other sister thread.

The problem is that their answer is frankly *wrong*:

  A.  The two server addresses you supply your registrar when you set up a
  domain on the UltraDNS system are actually 'virtual' addresses that will
  route to the best possible server on our network, based on a number of
  factors. This highly intelligent mechanism allows you to achieve full
  redundancy and reliability with only two name server addresses actually
  listed. In fact, if the registrar would allow you to do so, you could
  achieve the same level of reliability with only one name server address.

Anycast is *NOT* a redundancy and reliability system when dealing with
application-based services like DNS.  Rather, anycast is a geographically
biased traffic distribution system.  There is a subtle but important
difference here:

DNS site A advertises anycast networks 1.2.3.0/24 and 1.2.4.0/24.
DNS site B advertises anycast networks 1.2.3.0/24 and 1.2.4.0/24.

Host site C attempts to use DNS servers from DNS sites A or B based on best
anycast route selection.  Host site C's router happens to pick DNS site A as
best route for both 1.2.3.0/24 and 1.2.4.0/24.

DNS site A goes down, but its BGP advertisements are still in effect.
(Their firewall still appears to be up, but DNS requests fail.)  Host site C
cannot resolve ANYTHING from DNS site A, even though DNS site B is still up
and running.  But host site C cannot see DNS site B!

Get the picture yet?

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread Leo Bicknell
In a message written on Thu, Sep 18, 2003 at 10:05:15AM -0400, Todd Vierling wrote:
 Anycast is *NOT* a redundancy and reliability system when dealing with
 application-based services like DNS.  Rather, anycast is a geographically

I think you'll find most people on the list would disagree with you
on this point.  Many ISP's run anycast for customer facing DNS
servers, and I'll bet if you ask the first reason why isn't because
they provide faster service, or distribute load, but because the
average customer only wants one or two IP's to put in his DNS config,
and gets real annoyed when they don't work.  So it is a redundancy
and reliability thing, the customer can configure (potentially) one
address, and the ISP can have 10 servers for it so if one dies all
is well.

Is it appropriate for a gTLD?  Now that's a whole different can of
worms.  Personally I think they should return the two anycast
addresses, and as many actual server addresses as will fit in the
packet.  This is the best of both worlds.  When it works, geographicly
distributed load, redundancy at the IP layer, quick responces.  When
one of the failure modes is encountered (eg, stuck route) DNS has
the information it needs to switch to a backup as well.

Redundancy is good.  Redundancy at two levels is even better,
particularly when they can back each other up.  Plus, in this case it
costs them nothing, they just have to tweek a config.

-- 
   Leo Bicknell - [EMAIL PROTECTED] - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/
Read TMBG List - [EMAIL PROTECTED], www.tmbg.org


pgp0.pgp
Description: PGP signature


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, Leo Bicknell wrote:

:  Anycast is *NOT* a redundancy and reliability system when dealing with
:  application-based services like DNS.  Rather, anycast is a geographically
:
: I think you'll find most people on the list would disagree with you
: on this point.  Many ISP's run anycast for customer facing DNS
: servers, and I'll bet if you ask the first reason why isn't because
: they provide faster service, or distribute load, but because the
: average customer only wants one or two IP's to put in his DNS config,
: and gets real annoyed when they don't work.

And guess what:  neither of the two addresses supplied by UltraDNS worked
last night for some sites, because their anycast configuration is not
allowing DNS redundancy.  It is depending on every site somehow choosing
different routes for both addresses, which is not guaranteed.

Anycasting only works as a redundancy scheme when you have a mesh of
*partially* overlapping BGP advertisements, so that a client has a guarantee
that at least one address in the mix is located elsewhere from the rest.

: So it is a redundancy and reliability thing, the customer can configure
: (potentially) one address, and the ISP can have 10 servers for it so if
: one dies all is well.

But if all such anycast addresses have the ability to point to the same
physical location, there is only an illusion of redundancy, because there's
no way to get an alternate access point to the zone if a site is choosing a
dead route for all server addresses.  It doesn't matter how many other
servers at the DNS provider are still working, because some sites can choose
-- and have demonstrably chosen -- a single, dead site for all available
anycast NS addresses in a setup like this (UltraDNS's .ORG configuration).

: Is it appropriate for a gTLD?

UltraDNS's setup isn't even appropriate for a 2LD.  I'm damned glad that I
don't have my subdomains hosted there.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread David Lesher

Speaking on Deep Background, the Press Secretary whispered:
 
 : I think you'll find most people on the list would disagree with you
 : on this point.  Many ISP's run anycast for customer facing DNS
 : servers, and I'll bet if you ask the first reason why isn't because
 : they provide faster service, or distribute load, but because the
 : average customer only wants one or two IP's to put in his DNS config,
 : and gets real annoyed when they don't work.

And/or, the networking stack may accept 3,4{...}50 DNS addresses,
but only really looks at the first.




-- 
A host is a host from coast to [EMAIL PROTECTED]
 no one will talk to a host that's close[v].(301) 56-LINUX
Unless the host (that isn't close).pob 1433
is busy, hung or dead20915-1433


Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

TV Date: Thu, 18 Sep 2003 10:05:15 -0400 (EDT)
TV From: Todd Vierling


TV DNS site A goes down, but its BGP advertisements are still in
TV effect.

Or are they?


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

TV Date: Thu, 18 Sep 2003 11:39:17 -0400 (EDT)
TV From: Todd Vierling


TV And guess what:  neither of the two addresses supplied by
TV UltraDNS worked last night for some sites, because their
TV anycast configuration is not allowing DNS redundancy.  It is
TV depending on every site somehow choosing different routes for
TV both addresses, which is not guaranteed.

I don't know what UDNS does internally, but ideally anycast:

+ Has steady, unchanging EGP adverts
+ Has service-providing boxen that advert/withdraw prefixes in
  the IGP depending on their status
+ Includes an internal network, so that flaps are contained.

If done properly, anycast means _all_ pods must fail to create a
failure condition.  If done improperly, it means _any_ pod
failure can create a partial failure condition -- which means the
probability of failure _increases_ with the number of pods.


TV Anycasting only works as a redundancy scheme when you have a
TV mesh of *partially* overlapping BGP advertisements, so that a
TV client has a guarantee that at least one address in the mix
TV is located elsewhere from the rest.

Don't be silly.  This is like claiming that multihoming only
works if you spread services over different netblocks.


TV But if all such anycast addresses have the ability to point
TV to the same physical location, there is only an illusion of
TV redundancy, because there's no way to get an alternate access
TV point to the zone if a site is choosing a dead route for all
TV server addresses.  It doesn't matter how many other servers

Ergo, that's why one withdraws the routes when a pod dies.
Routes need to reflect what's up.  Funny thing is, standard BGP
has the same requirement.

You're correct that an incorrect anycast setup can cause trouble,
and arguably more than unicast.  However, claiming that anycast
is inherently bad is really, really silly.


Eddy (no selfish interest in defending UltraDNS)
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, E.B. Dreger wrote:

: TV Date: Thu, 18 Sep 2003 10:05:15 -0400 (EDT)
: TV From: Todd Vierling
:
: TV DNS site A goes down, but its BGP advertisements are still in
: TV effect.
:
: Or are they?

I couldn't know for sure from some sites, but traceroutes sure got there.
That would imply that (at their end) the advertisements were still up.

BGP has no way to know that an internal network problem occurred.  If
someone mistakenly tripped over a network cable that disconnected DNS
clusters from a router, how would the router know to drop anycast
advertisements?

(Sure, you could run zebra on the cluster.  But what about if the name
server SEGVs?  There's a lot of possible scenarios)

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, E.B. Dreger wrote:

: TV Anycasting only works as a redundancy scheme when you have a
: TV mesh of *partially* overlapping BGP advertisements, so that a
: TV client has a guarantee that at least one address in the mix
: TV is located elsewhere from the rest.
:
: Don't be silly.  This is like claiming that multihoming only
: works if you spread services over different netblocks.

We're talking about application (DNS) redundancy here, not transport-level
(6to4 anycast RFC comes to mind) redundancy.  With this in mind:

: Ergo, that's why one withdraws the routes when a pod dies.
: Routes need to reflect what's up.

BGP doesn't know when a DNS server dies.  Therein lies the findamental
problem of using anycast as an application redundancy scheme.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

TV Date: Thu, 18 Sep 2003 13:01:18 -0400 (EDT)
TV From: Todd Vierling


TV BGP doesn't know when a DNS server dies.  Therein lies the
TV findamental problem of using anycast as an application
TV redundancy scheme.

But it can and should.  Again, seeing if the process is running
is easy; verifying correct functionality requires more work, but
definitely is doable.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

TV Date: Thu, 18 Sep 2003 12:52:29 -0400 (EDT)
TV From: Todd Vierling


TV I couldn't know for sure from some sites, but traceroutes
TV sure got there.  That would imply that (at their end) the
TV advertisements were still up.

Which would be an implementation flaw, not something inherently
wrong with anycast.


TV (Sure, you could run zebra on the cluster.  But what about if
TV the name server SEGVs?  There's a lot of possible
TV scenarios)

That's why the routing daemon must be aware if the service is up
or not.  It requires custom or modified routing software.

Having zebra stat(2) a file that the DNS daemon periodically
touches is a quick way to verify that the DNS server software is
still running.  Easy enough.  Gross, but effective, and easy
enough.

A proper implementation has the routing daemon monitor the
service in question -- in this case DNS.  If a series of test
queries provide the correct response, all is well; if not, it's
time to yank the route.

Again, perhaps there are implementation flaws... I don't know
anything about UltraDNS's internal network.  But these can be
fixed, and do not make anycast inherently unreliable.  If one
understands, thinks about, and approaches the problem, it can be
solved.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread bmanning

 TV BGP doesn't know when a DNS server dies.  Therein lies the
 TV findamental problem of using anycast as an application
 TV redundancy scheme.
 
 But it can and should.  Again, seeing if the process is running
 is easy; verifying correct functionality requires more work, but
 definitely is doable.
 
 
 Eddy
 --

Ick.  you really believe that BGP can or should be augmented to 
understand application liveness?   BGP reaching past the router,
running a ps -augx and then performing applications specific tricks?

I guess that when all you have/understand is a hammer, everything
becomes a nail.

Wait...  Its a joke!  you just forgot the :)

--bill


Re: .ORG problems this evening

2003-09-18 Thread Stephen J. Wilcox


On Thu, 18 Sep 2003, Todd Vierling wrote:

 
 On Thu, 18 Sep 2003, E.B. Dreger wrote:
 
 : TV Date: Thu, 18 Sep 2003 10:05:15 -0400 (EDT)
 : TV From: Todd Vierling
 :
 : TV DNS site A goes down, but its BGP advertisements are still in
 : TV effect.
 :
 : Or are they?
 
 I couldn't know for sure from some sites, but traceroutes sure got there.
 That would imply that (at their end) the advertisements were still up.
 
 BGP has no way to know that an internal network problem occurred.  If
 someone mistakenly tripped over a network cable that disconnected DNS
 clusters from a router, how would the router know to drop anycast
 advertisements?
 
 (Sure, you could run zebra on the cluster.  But what about if the name
 server SEGVs?  There's a lot of possible scenarios)

ALmost there.. just make sure your zebra IGPs are redistributing to your BGP so 
that a failure such as that knocks out the bgp too

Steve



Re: .ORG problems this evening

2003-09-18 Thread Keptin Komrade Dr. BobWrench III esq.
Todd Vierling wrote:

BGP doesn't know when a DNS server dies.  Therein lies the findamental
problem of using anycast as an application redundancy scheme.
You ever think that maybe, just maybe, Ultra wrote some code to do this?

Yes, it might have concievably failed in a way that seems to have left 
you and one or two others in the veritable dark, but I don't think, at 
this point, using NANOG to debug the problem, no matter where it was, is 
going to be very productive.

But, of course, I don't know anything about using DNS and anycast. ;-)

Bob








Re: .ORG problems this evening

2003-09-18 Thread Keptin Komrade Dr. BobWrench III esq.
E.B. Dreger wrote:

TV Date: Thu, 18 Sep 2003 13:01:18 -0400 (EDT)
TV From: Todd Vierling
TV BGP doesn't know when a DNS server dies.  Therein lies the
TV findamental problem of using anycast as an application
TV redundancy scheme.
But it can and should.  Again, seeing if the process is running
is easy; verifying correct functionality requires more work, but
definitely is doable.
And, I might add, in the case of a highly complex anycast application, 
you will need to check not only for correctness, but for timeliness. 
And, again, in the case of a highly complex app such as an anycast DNS, 
you need to check several behind the scenes apps, such as maybe a db, 
the responsivness of your high avail partner server, the dns daemon, 
connectivity through two or more network paths, connectivity to master 
update servers, BGP on whatever boxes are providing BGP, etc, the list 
goes on.

But again, that's just my opinion, I could be wrong. ;-)





Re: .ORG problems this evening

2003-09-18 Thread just me

On Thu, 18 Sep 2003, Todd Vierling wrote:

  BGP has no way to know that an internal network problem occurred.  If
  someone mistakenly tripped over a network cable that disconnected DNS
  clusters from a router, how would the router know to drop anycast
  advertisements?

  (Sure, you could run zebra on the cluster.  But what about if the name
  server SEGVs?  There's a lot of possible scenarios)


I can assure you, this is a solved problem.


[EMAIL PROTECTED]darwin
   Flowers on the razor wire/I know you're here/We are few/And far
   between/I was thinking about her skin/Love is a many splintered
   thing/Don't be afraid now/Just walk on in. #include disclaim.h



Re: .ORG problems this evening

2003-09-18 Thread bmanning

  BGP has no way to know that an internal network problem occurred.  If
  someone mistakenly tripped over a network cable that disconnected DNS
  clusters from a router, how would the router know to drop anycast
  advertisements?
  
  (Sure, you could run zebra on the cluster.  But what about if the name
  server SEGVs?  There's a lot of possible scenarios)
 
 ALmost there.. just make sure your zebra IGPs are redistributing to your BGP so 
 that a failure such as that knocks out the bgp too
 
 Steve
 
Sorry no zebra.  Perhaps I should run my TLDs
DNS service on my Juniper Routers.  some expect/cron
work should provide the needed glue...

Now if I could just get cisco to add authoritative 
DNS service to IOS, right up there with the HTTP, firewall,
content caching, and load-balancing cruft they have 
added to their basic routing code...  I could use
cisco too! (may still need some glue tho)

In case it was not clear, I think that multi-tasking 
hardware might be the wrong choice.  I want my routers
to route and not do apps work.  For apps, I want them
to be single-app specific.  DNS service on its own hardware,
NTP on its platform, HTTP outsourced to (vendor), etc.

This has impact on the design of anycast solutions.
Ultra has one model, ISC has another, and PCH uses
a third. The more generic content crowd has its favorites.
Then there are the load-balancing vendors who
cater to these folks.  One size does not fit all.

--bill


anycast (Re: .ORG problems this evening)

2003-09-18 Thread E.B. Dreger

 Date: Thu, 18 Sep 2003 13:47:01 -0400
 From: Keptin Komrade Dr. BobWrench III esq.


 And, I might add, in the case of a highly complex anycast
 application, you will need to check not only for correctness,
 but for timeliness.

In a realtime system, something that is late is considered
incorrect.  A DNS response that arrives after three seconds is
unsat, and (from a RT perspective) incorrect.  I should have been
more clear in my wording.


 And, again, in the case of a highly complex app such as an
 anycast DNS, you need to check several behind the scenes apps,
 such as maybe a db, the responsivness of your high avail
 partner server, the dns daemon, connectivity through two or
 more network paths, connectivity to master update servers, BGP
 on whatever boxes are providing BGP, etc, the list goes on.

Yes on all counts, except perhaps connectivity... BGP handles
that.  If you mean killing the link in case of saturation, I'd
argue that's a bad idea -- that just means the large traffic
quantity will go elsewhere.


 But again, that's just my opinion, I could be wrong. ;-)

That's why one uses a daemon with main loop including something
like:

success = 0 ;
for ( i = checklist ; i-callback != NULL ; i++ )
success = i-callback(foo) ;
if ( success )
send_keepalive(via_some_ipc_mechanism) ;

The BGP mechanism listens for keepalives via the IPC mechanism.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

 Date: Thu, 18 Sep 2003 10:29:06 -0700 (PDT)
 From: bmanning


 Ick.  you really believe that BGP can or should be augmented to
 understand application liveness?   BGP reaching past the

And why not?  BGP deals in reachability information.  Perhaps it
conventionally represents interface and link state, but there is
nothing making that the One True Way.

From the BGP scanner's perpective, it's just checking another
keepalive.  What generates the keepalive for the route matters
not.  Do you mean that a dead server is just as up as a live
server, yet a dead link is not as up as a live link?  That's
preposterous.


 router, running a ps -augx and then performing applications
 specific tricks?

No need to use gross shell scripts.  Far better means of IPC
exist.  Please read my previous messages.


 I guess that when all you have/understand is a hammer,
 everything becomes a nail.

If you have any specific technical complaints (not how it's
usually done doesn't count), I'm all ears.  I'm also open to a
better way; my MUA seems to have truncated the part where you
suggested one. :-)


 Wait...  Its a joke!  you just forgot the :)

No.  It works well, as long as flaps are confined.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, Keptin Komrade Dr. BobWrench III esq. wrote:

: And, I might add, in the case of a highly complex anycast application,
: you will need to check not only for correctness, but for timeliness.

All this still assumes that DNS should be trusting a single anycast location
as the only point of access (a situation which is the case for UltraDNS if
both records' routes go to the same place).

There's a reason DNS does not trust exactly one server if multiple ones are
provided:  too many things can and do go wrong.  What is going on right now
with .ORG is that DNS is being forced to believe that BGP knows what is best
for it, and it's already demonstrated that BGP did not always know best.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: anycast (Re: .ORG problems this evening)

2003-09-18 Thread E.B. Dreger

EBD Date: Thu, 18 Sep 2003 18:01:07 + (GMT)
EBD From: E.B. Dreger


EBD That's why one uses a daemon with main loop including
EBD something like:
EBD
EBDsuccess = 0 ;
EBDfor ( i = checklist ; i-callback != NULL ; i++ )
EBDsuccess = i-callback(foo) ;
EBDif ( success )
EBDsend_keepalive(via_some_ipc_mechanism) ;

Eek!

s,success = 0,success = 1,


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, John Fraizer wrote:

: As has been stated by others, UltraDNS, like the roots and other TLD hosts
: is under nearly constant attack.  Perhaps your local nodes were effected
: by an attack. IE; the pipe was full but the service was still alive so the
: anycast prefix wasn't retracted.  Bummer.  Sucks to be you.

Sucks to be anyone trying to use the service whose routers pick those nodes
as the only ones available.  That's the fault of the implementor, not the
client.

The major issue here is that no *gTLD*, particularly one of the Big Three,
should be subject to a SPOF -- even if it's only a regionally visible SPOF
due to anycast selection.  It should *always* be possible to attempt queries
to more than one physical location's servers for a gTLD.  Yet last night, I
could not query .ORG from several different locations in the continental US,
even though there were perfectly functional servers available (in the same
country, no less).

BGP errors happen (everyone here should be able to attest to that readily),
and they did.  What's to stop some other boneheaded DoS or oversight from
causing this again?  And again?

This particular outage was in the late evening in what appeared to be the
affected area from my probing, which is why people like you don't appear to
care; it didn't affect you.  What about when it happens in the middle of
the day in your neck of the woods?

: Doesn't really matter to me though.  Bitch and moan all you like.
: Demonstrate your lack of experience and understanding.

Uh-huh.  Quite a few people here know better; they also know I am surrounded
by cloak/ on this list and others.  If my public resume were up to date
and filled in more detail, you'd know otherwise.  Don't try to speak for my
experience from your pedestal when you don't have the information to make
that kind of baseless judgment.

On the other hand, if you can't see the fatal flaw in a major Internet
infrastructure service depending on a single point of failure, I can point
you at a few books that could enlighten you.

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread bmanning

 Bill, I know you know better, so let's try more facts and less
 FUD.  Mmmmkay?  Your above paragraph is a red herring that is
 analogous to saying all multihomed services must be run on the
 router itself.

yes, it does lean that way... but to expose a sigma-six
blip in how some people may think about anycasting techniques.

 Here's the deal:  DNS server runs a BGP/OSPF/whatever speaker.

One model.  ISC is enamored of this model.  I'm not.
http://www.isc.org/tn/isc-tn-2003-1.txt

 You won't find a turnkey RPM to do it, but that doesn't mean it's
 impossible.  In fact, if you slow down and read previous posts,
 you'll note some very big hints re how to build such a working
 system.  If you're limited to installing out-of-the-box packages,
 you _will_ have a huge mess... but that's not my problem.

Nope, it can even be done w/ COTS technologies.
Been there, Done that.  Ate the cheese as fondue.


  This has impact on the design of anycast solutions.
  Ultra has one model, ISC has another, and PCH uses
  a third. The more generic content crowd has its favorites.
  Then there are the load-balancing vendors who
  cater to these folks.  One size does not fit all.
 
 Okay, I'll give you credit for that paragraph.

thanks.  we now return you to your worst-design
showtell.  (my fav today  optical connectors!)

 Eddy

--bill


Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

TV Date: Thu, 18 Sep 2003 14:22:19 -0400 (EDT)
TV From: Todd Vierling


TV Sucks to be anyone trying to use the service whose routers
TV pick those nodes as the only ones available.  That's the
TV fault of the implementor, not the client.

Yes.


TV The major issue here is that no *gTLD*, particularly one of
TV the Big Three, should be subject to a SPOF -- even if it's
TV only a regionally visible SPOF

Yes.


TV due to anycast selection.

Which would be due to a broken implementation.

Broken unicast is bad.  Not all unicast is bad.  Broken anycast
is bad.  Not all anycast is bad.


TV It should *always* be possible to attempt queries to more
TV than one physical location's servers for a gTLD.

_Or_ guarantee that the physical location selected was indeed up.
Again, it smells an awful lot like plain old multihoming... if
you advertise the route, you'd better be ready to handle the
traffic.  (Did someone say 7007?)


TV BGP errors happen (everyone here should be able to attest to
TV that readily), and they did.  What's to stop some other
TV boneheaded DoS or oversight from causing this again?  And
TV again?

I've had problems with unicast when a link went down, yet the
upstream continued advertising the routes.  BGP stupidity happens
with unicast service, too.

Yes, anycast requires some additional thought and out-of-box
thinking.  But that doesn't make it inherently unstable.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: anycast (Re: .ORG problems this evening)

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, E.B. Dreger wrote:

: EBD That's why one uses a daemon with main loop including
: EBD something like:
: EBD
: EBD  success = 0 ;
: EBD  for ( i = checklist ; i-callback != NULL ; i++ )
: EBD  success = i-callback(foo) ;
: EBD  if ( success )
: EBD  send_keepalive(via_some_ipc_mechanism) ;
:
: Eek!
:
: s,success = 0,success = 1,

Heh.  I'll send you some coffee.

Yes, I hope that UltraDNS implements something like this, if they have not
already.  It's still not a guarantee that things will get withdrawn -- or be
reachable, even if working but not withdrawn -- in case of a problem.  That
still leaves the DNS for a gTLD at risk for a single point of failure.

Maybe I should just chalk this up to history at this point.  I have a
feeling, though, that the head of this thread's archive URL will show up as
a citation some time from now when something else goes wrong with the zone.

sigh

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread E.B. Dreger

 Date: Thu, 18 Sep 2003 11:36:37 -0700 (PDT)
 From: bmanning


  Bill, I know you know better, so let's try more facts and less
  FUD.  Mmmmkay?  Your above paragraph is a red herring that is
  analogous to saying all multihomed services must be run on the
  router itself.

   yes, it does lean that way... but to expose a sigma-six
   blip in how some people may think about anycasting techniques.

Regardless of the technology, one can _always_ create a stupid
way of doing things.  With any luck, however, a _good_ way
exists, too.


  Here's the deal:  DNS server runs a BGP/OSPF/whatever speaker.

   One model.  ISC is enamored of this model.  I'm not.
   http://www.isc.org/tn/isc-tn-2003-1.txt

Yes, one model.  Skimming the ISC paper, I also have mixed
feelings about some sections.  The basic principle, however,
boils down to getting traffic to the right place based on
factors such as reachability and correctness.


   Nope, it can even be done w/ COTS technologies.

Noted.  I suppose some implementations may indeed be turnkey...
just that we've never seen the One True Tarball for the way we
like to do it.  My fault for overgeneralizing.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, John Fraizer wrote:

: Todd, you don't make the announcement for the anycast address from your
: border..  You do it from within the anycast cluster as a CONDITIONAL
: announcement.  IE; you use a specially written BGP daemon that makes the
: announcement when the service is alive and retracts it when it isn't.

Um, I did in fact previously mention running BGP on the cluster -- which
was referring directly to the DNS service machines -- and you even responded
to that message.  Yes, I do understand.  (Ref:  One of the things I do for a
living is work on a BGP4 peer implementation written from scratch.)

Doing this requires implementing keepalive handling in the service
monitoring side of the world correctly.  Which, obviously, *the entity in
question did not*.  Because of this, I can no longer trust them to get it
right next time without changing their fundamental design.  It's not like
this is all that hard to grasp:  the services for a TLD are much more
critical than a 2LD or 3LD and should be given much more thought into
failover handling than just anycast will do it for us.

The other two of the Big Three gTLDs, and most ccTLDs, allow a client to
attempt queries to geographically diverse DNS servers at any time,
regardless of the BGP table's correctness, in order to allow some additional
level of failover and reliability assessment by the DNS client.  Some of
these servers could run anycast, and I wouldn't even know it without looking
deeper.  What I can trivially see, though, is where geographically diverse
servers are available on said TLDs, I can get a guarantee that at least two
from each zone's NS group go to different places.

Why is .ORG somehow different and special that I/we should trust a third
party to do the whole operation solely via anycast, where said anycast has
the possibility of becoming a single point of failure?

-- 
-- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]


Re: .ORG problems this evening

2003-09-18 Thread Majdi S. Abbas

On Thu, Sep 18, 2003 at 02:22:19PM -0400, Todd Vierling wrote:
 Sucks to be anyone trying to use the service whose routers pick those nodes
 as the only ones available.  That's the fault of the implementor, not the
 client.

I have a sneaking suspicion that if UltraDNS's tld cluster that is
apparently located in Equinix-Ashburn stopped responding to queries for
two hours last night, a lot more people would have noticed.  A *lot* more
people.

I think it's out of line to speculate on how UltraDNS has configured
these clusters, particularly in terms of how reachability information is
verified and propagated without any knowledge of their configuration.

 The major issue here is that no *gTLD*, particularly one of the Big Three,
 should be subject to a SPOF -- even if it's only a regionally visible SPOF
 due to anycast selection.  It should *always* be possible to attempt queries
 to more than one physical location's servers for a gTLD.  Yet last night, I
 could not query .ORG from several different locations in the continental US,
 even though there were perfectly functional servers available (in the same
 country, no less).

First it was two locations, one of which you can't tell us about
(Deep inside OSPF Area 51?) -- now it's several?  I've tried myself from 
many different hosts today, and they all route to different clusters.  I'm
having trouble finding more than one, geographically diverse host that 
routes to the same cluster.

 BGP errors happen (everyone here should be able to attest to that readily),
 and they did.  What's to stop some other boneheaded DoS or oversight from
 causing this again?  And again?

Are you absolutely, positively sure this cluster was responding to 0
queries, but still propagating those two /24's?

 This particular outage was in the late evening in what appeared to be the
 affected area from my probing, which is why people like you don't appear to
 care; it didn't affect you.  What about when it happens in the middle of
 the day in your neck of the woods?

The reason for this is simple -- given the query volume a tld like
.org receives, and given just how close this cluster is to so many 
millions of users in the eastern US, the odds of you being the *only* 
person, even amongst the few thousand on this list, to notice a problem...
are incredibly slim.

Since you won't tell us where these several hosts you tried to
query from are addressed, and you won't tell us exactly which queries
you tried, and how...it is incredibly hard to look into.

This is the equivalent of calling every fire department in the
nation and telling them that there is a fire, but refusing to tell them
where you are, or what you've witnessed.

 Uh-huh.  Quite a few people here know better; they also know I am surrounded
 by cloak/ on this list and others.  If my public resume were up to date
 and filled in more detail, you'd know otherwise.  Don't try to speak for my
 experience from your pedestal when you don't have the information to make
 that kind of baseless judgment.

 On the other hand, if you can't see the fatal flaw in a major Internet
 infrastructure service depending on a single point of failure, I can point
 you at a few books that could enlighten you.

It isn't a single point of failure, but even if it were, I can
assure you that the collective experience of this list would fill quite
a few more volumes then you are capable of referring us to.

You ask that we make no assumptions as to your experience --
grant us the same courtesy.

--msa


Re: .ORG problems this evening

2003-09-18 Thread Todd Vierling

On Thu, 18 Sep 2003, Majdi S. Abbas wrote:

:  Sucks to be anyone trying to use the service whose routers pick those nodes
:  as the only ones available.  That's the fault of the implementor, not the
:  client.

:   I think it's out of line to speculate on how UltraDNS has configured
: these clusters,

I don't care what the underlying implementation is.  I care about the
effect:  that for at least one hour, possibly up to two last night, one of
the physical locations went dead but was still considered available via
BGP, while being considered the best.available path to both nets.

:   First it was two locations, one of which you can't tell us about
: (Deep inside OSPF Area 51?)

I can't provide all the exact source machines for reasons I can discuss
offlist, but I'm happy to do so to a representative of UltraDNS.  My home
machine, though, is 66.56.93.94.

: now it's several?

Three to be exact that I verified last night to be unable to query DNS from
either IP address: one at my home (Atlanta GA), one at my employer (Atlanta
GA), and one in Chicago IL.  However, here's three straw examples of both
IPs going to the same place from spot checks right now (funny, my home
machine actually gets two different ones at this moment):

= Southern CA =
traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets
...
 .  p4-1-0-0.r00.lsanca01.us.bb.verio.net (129.250.16.80)  16.9 ms (ttl=251!)
 .  p16-1-1-0.r21.lsanca01.us.bb.verio.net (129.250.2.10)  19.5 ms (ttl=250!)
 .  ge-1-0.a01.lsanca02.us.ra.verio.net (129.250.29.131)  3.44 ms
 .  66.238.50.26.ptr.us.xo.net (66.238.50.26)  13.2 ms (ttl=248!)
 .  dellfwisi.ultradns.net (204.74.98.2)  13.8 ms (ttl=57!) !H

traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets
...
 .  p5-1-0-0.RAR1.LA-CA.us.xo.net (65.106.5.13)  2.64 ms (ttl=250!)
 .  p0-0-0.MAR1.LA-CA.us.xo.net (65.106.5.6)  2.73 ms (ttl=249!)
 .  p1-0.CHR1.LA-CA.us.xo.net (207.88.81.166)  2.78 ms
 .  66.238.50.26.ptr.us.xo.net (66.238.50.26)  35.0 ms
 .  dellfwisi.ultradns.net (204.74.98.2)  29.7 ms (ttl=57!) !H

= Dallas TX =
traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets
...
 .  p16-0-0-0.r01.atlnga03.us.bb.verio.net (129.250.4.195)  25.3 ms (ttl=250!)
 .  p16-2-0-0.r00.atlnga03.us.bb.verio.net (129.250.5.16)  25.3 ms (ttl=249!)
 .  p16-1-0-0.r01.mclnva02.us.bb.verio.net (129.250.2.48)  40.8 ms (ttl=247!)
 .  ge-1-0-0.a00.mclnva02.us.ra.verio.net (129.250.31.170)  40.8 ms (ttl=246!)
 .  168.143.247.38 (168.143.247.38)  44.1 ms (ttl=246!)
 .  64.124.112.141.ultradns.com (64.124.112.141)  45.0 ms (ttl=244!)
 .  dellfwpxvn.ultradns.net (204.74.104.2)  43.7 ms (ttl=53!) !H

traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets
...
 .  sl-bb26-fw-5-1.sprintlink.net (144.232.20.147)  7.54 ms
 .  sl-bb25-fw-15-0.sprintlink.net (144.232.11.89)  32.0 ms
 .  sl-bb23-atl-10-0.sprintlink.net (144.232.20.60)  36.4 ms
 .  sl-bb26-rly-14-1.sprintlink.net (144.232.20.65)  33.3 ms
 .  sl-st21-ash-14-2.sprintlink.net (144.232.20.3)  34.8 ms
 .  sl-xocomm-5-0.sprintlink.net (144.223.246.50)  34.2 ms
 .  p5-0-0.RAR1.Washington-DC.us.xo.net (65.106.3.133)  35.3 ms (ttl=245!)
 .  p6-1-0.MAR1.Washington-DC.us.xo.net (65.106.3.182)  35.7 ms (ttl=244!)
 .  p0-0.CHR1.Washington-DC.us.xo.net (207.88.87.10)  35.7 ms
 .  64.124.112.141.ultradns.com (64.124.112.141)  39.7 ms (ttl=244!)
 .  dellfwpxvn.ultradns.net (204.74.104.2)  40.0 ms (ttl=53!) !H

= Chicago IL =
traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets
...
 .  gige3-2.core2.Chicago1.Level3.net (209.244.8.185)  0.796 ms
 .  so-4-1-0.bbr1.Chicago1.level3.net (209.247.10.165)  0.905 ms (ttl=250!)
 .  so-6-0-0.edge1.Chicago1.Level3.net (209.244.8.10)  1.01 ms (ttl=249!)
 .  verio-level3-oc12.Chicago1.Level3.net (209.0.227.66)  0.860 ms (ttl=251!)
 .  ge-1-2.a00.chcgil07.us.ra.verio.net (129.250.25.136)  0.967 ms (ttl=253!)
 .  fa-2-1.a00.chcgil07.us.ce.verio.net (128.242.186.134)  1.04 ms (ttl=251!)
 .  dellfweqch.ultradns.net (204.74.102.2)  0.881 ms (ttl=60!) !H

traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets
...
 .  0.so-1-0-0.XL2.CHI13.ALTER.NET (152.63.69.182)  1.58 ms (ttl=251!)
 .  POS7-0.BR1.CHI13.ALTER.NET (152.63.73.22)  1.29 ms
 .  a11-0d114.IR1.Chicago2-IL.us.xo.net (206.111.2.73)  1.11 ms (ttl=251!)
 .  p5-0-0.RAR1.Chicago-IL.us.xo.net (65.106.6.133)  1.40 ms
 .  p4-0-0.MAR1.Chicago-IL.us.xo.net (65.106.6.142)  2.03 ms
 .  p0-0.CHR1.Chicago-IL.us.xo.net (207.88.84.10)  1.80 ms (ttl=248!)
 .  *
 .  dellfweqch.ultradns.net (204.74.102.2)  1.48 ms (ttl=60!) !H

===

:   Are you absolutely, positively sure this cluster was responding to 0
: queries,

Yes.  My mail server was more or less dead (it's a .org) for an hour, and I
was trying frantically to get DNS to resolve with all kinds of dig
requests directly to the IPs and traceroute tests until I gave up after an
hour.

: but still propagating those 

Re: .ORG problems this evening

2003-09-17 Thread Jared Mauch

On Thu, Sep 18, 2003 at 12:50:28AM -0400, Todd Vierling wrote:
 
 tld[12].ultradns.net, the NS for .ORG, was completely unreachable for about
 an hour or two this evening, timing out on all DNS queries.  Anyone else see
 similar?  (The hosts are unpingable and untracerouteable, so I had to use
 DNS queries to determine when they were back up.)
 
 It makes me wonder how UltraDNS got a contract to manage the domain on all
 of two nameservers hosted on the same subnet, given that they were supposed
 to have deployed geographically diverse (or something like that) servers.
 But then, we know ICANN smokes the crack liberally at times

dare i say duh,

but ...

ultradns uses the power of anycast to have these ips that appear
to be on close subnets in geographyically diverse locations.

go to europe, traceroute to them, it goes to a place
in europe.

go to asia, traceroute to them, it goes to a machine in
asia.

in the us, it goes to one of a few geographical locations ...

could you provide some more technical details, other than
your postulations that they have two machines on
network-wise close subnets and that is the problem?

- jared

 sigh
 
 -- 
 -- Todd Vierling [EMAIL PROTECTED] [EMAIL PROTECTED]

-- 
Jared Mauch  | pgp key available via finger from [EMAIL PROTECTED]
clue++;  | http://puck.nether.net/~jared/  My statements are only mine.


Re: .ORG problems this evening

2003-09-17 Thread E.B. Dreger

TV Date: Thu, 18 Sep 2003 00:50:28 -0400 (EDT)
TV From: Todd Vierling


TV tld[12].ultradns.net, the NS for .ORG, was completely
TV unreachable for about an hour or two this evening, timing out
TV on all DNS queries.  Anyone else see similar?  (The hosts are

I don't recall having troubles this evening.  Perhaps there was a
DoS or something pounding the anycast node you were hitting?
With multiple sinkholes, it's no longer all or nothing.

Anycast is good stuff, IMHO, but not impervious to flooding.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-17 Thread Christopher L. Morrow



On Thu, 18 Sep 2003, Todd Vierling wrote:


 It makes me wonder how UltraDNS got a contract to manage the domain on all
 of two nameservers hosted on the same subnet, given that they were supposed
 to have deployed geographically diverse (or something like that) servers.
 But then, we know ICANN smokes the crack liberally at times

Just because they hosts are on the same subnet and are apparently behind
the same end device for you doesn't make them non-geographically diverse
if they are really anycast pods, does it? It really just means one anycast
pod was down for a time :(

It is one of the things that anycast makes difficult though :(
Troubleshooting anycast from the outside is a bear.


Re: .ORG problems this evening

2003-09-17 Thread Christopher L. Morrow


On Thu, 18 Sep 2003, Christopher L. Morrow wrote:

 On Thu, 18 Sep 2003, Todd Vierling wrote:

 
  It makes me wonder how UltraDNS got a contract to manage the domain on all
  of two nameservers hosted on the same subnet, given that they were supposed
  to have deployed geographically diverse (or something like that) servers.
  But then, we know ICANN smokes the crack liberally at times

 Just because they hosts are on the same subnet and are apparently behind
 the same end device for you doesn't make them non-geographically diverse
 if they are really anycast pods, does it? It really just means one anycast
 pod was down for a time :(

 It is one of the things that anycast makes difficult though :(
 Troubleshooting anycast from the outside is a bear.


Oh, and 'same subnet' doesn't mean 'same ethernet' all auth dns servers in
198.6.1.0/24 aren't on one ethernet, though it'd sure make MY life easier
if they were :)


Re: .ORG problems this evening

2003-09-17 Thread E.B. Dreger

CLM Date: Thu, 18 Sep 2003 05:28:05 + (GMT)
CLM From: Christopher L. Morrow


CLM Just because they hosts are on the same subnet and are
CLM apparently behind the same end device for you doesn't make
CLM them non-geographically diverse if they are really anycast
CLM pods, does it? It really just means one anycast pod was down
CLM for a time :(

Ideally, though, an anycast node should yank the route if the
service in question dies.  I say ideally because we still
haven't had DNS properly make friends with BGP... and such flaps
really shouldn't be seen, which means having a contiguous
internal network and [properly] decoupling IGP from EGP...

...and suddenly I'm making many assumptions. ;-)


CLM It is one of the things that anycast makes difficult though
CLM :(  Troubleshooting anycast from the outside is a bear.

It's a lot like multihoming, only with different geography.
Unicast IP addresses are analogous to world-facing router
interfaces.

Tip for anyone considering playing with anycast, particularly on
the same ethernet segment:  Bind the anycast IP addresses to your
loopback interface.


Eddy
--
Brotsman  Dreger, Inc. - EverQuick Internet Division
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
_
  DO NOT send mail to the following addresses :
  [EMAIL PROTECTED] -or- [EMAIL PROTECTED] -or- [EMAIL PROTECTED]
Sending mail to spambait addresses is a great way to get blocked.



Re: .ORG problems this evening

2003-09-17 Thread Rodney Joffe



Todd Vierling wrote:
 
 tld[12].ultradns.net, the NS for .ORG, was completely unreachable for about
 an hour or two this evening, timing out on all DNS queries.  Anyone else see
 similar?  (The hosts are unpingable and untracerouteable, so I had to use
 DNS queries to determine when they were back up.)

At any given moment, UltraDNS (and I am sure other root and tld servers)
are under attack somewhere from someone. Additionally the monitors that
test each of the anycast nodes reported no outages. Neither did the
useful monitors that Rob Thomas runs at
(http://www.cymru.com/DNS/gtlddns-o.html) Nor did the many helpful
customers who use UltraDNS, and who run constant tests to each
individual anycast node in search of an SLA event that may provide a
service credit. ;-)

Perhaps you had a network problem internally?

 
 It makes me wonder how UltraDNS got a contract to manage the domain on all
 of two nameservers hosted on the same subnet, given that they were supposed
 to have deployed geographically diverse (or something like that) servers.

Fortunately ICANN and the other decision makers were actually network
clueful, and could tell that 204.74.112.1 and 204.74.113.1 are actually
different subnets ;-)

As an aside, using ping or traceroute at *any* time to see if dns
servers are working is not a great idea. 
-- 
Rodney Joffe
Speaking on behalf of no-one other himself.