RE: New Draft Document: De-boganising New Address Blocks

2004-02-25 Thread Michael . Dillon

>> Timothy Brown wrote:
>> I disagree with the view that it is a hack.
>> It's no more a hack than using a DNS feed;

>I concur with this. Besides, from the pragmatic side of the "consumer",
>if it does solve a problem (albeit short or medium term) I don't care
>much if it's a "hack".

Then we all agree. The Cymru bogon feed is a hack. The
completewhois feed is a hack. In fact, any third party
feed that puports to identify authentic IP allocations 
is a hack.

The only fix for this is to get the addressing authorities
to provide an authoritative feed. That probably means
first getting the RIRs to do it and then ICANN to fill
in the gaps.

As to whether the mechanisms used by bogon feeds are hacks
or carefully crafted technology, well, my rule of thumb
is that if it can completely automate the process while still
allowing human intervention to make judgement calls on all
changes to network configuration, then it is carefully crafted
technology. I'm afraid that by this measure, using BGP feed 
as the mechanism is a hack because it involves plugging a 3rd
party directly into your routing architecture. A directory
service like DNS or LDAP is closer to the carefully crafted 
technology because it can be plugged into OSS systems that
allow humans to validate and release any network changes.

--Michael Dillon





Re: Level 3 statement concerning 2/23 events (nothing to see, move along)

2004-02-25 Thread Pete Templin


Are you sure no one died as a result?  My hobby is volunteering as a 
firefighter and EMT.  If Level3's network sits between a dispatch center 
or mobile data terminal and a key resource, it could be a factor 
(hospital status website, hazardous materials action guide, VoIP link 
that didn't reroute because the control plane was happy but the 
forwarding plane was sad, etc.).

And if the problem could happen to another network tomorrow but could be 
prevented or patched, wouldn't inquiring minds want to know?  Your life 
might be more interesting when the fit hits the shan if you have the 
same vulnerability.

Colin Neeson wrote:

Because, in the the grand scale scheme of things, it's really not that 
important.

No one died because of it, the normal, everyday events of the world went 
on,
unaffected by a Level 3 outage...

Might be nice to know what happened, but my life will certainly not be 
less interesting by not having that knowledge...



Re: Level 3 statement concerning 2/23 events (nothing to see, move along)

2004-02-25 Thread Stephen J. Wilcox

So cmon, forget the statement, anyone know what actually happened.. ?

Steve

On Wed, 25 Feb 2004, Pete Templin wrote:

> 
> 
> Are you sure no one died as a result?  My hobby is volunteering as a 
> firefighter and EMT.  If Level3's network sits between a dispatch center 
> or mobile data terminal and a key resource, it could be a factor 
> (hospital status website, hazardous materials action guide, VoIP link 
> that didn't reroute because the control plane was happy but the 
> forwarding plane was sad, etc.).
> 
> And if the problem could happen to another network tomorrow but could be 
> prevented or patched, wouldn't inquiring minds want to know?  Your life 
> might be more interesting when the fit hits the shan if you have the 
> same vulnerability.
> 
> Colin Neeson wrote:
> 
> > 
> > Because, in the the grand scale scheme of things, it's really not that 
> > important.
> > 
> > No one died because of it, the normal, everyday events of the world went 
> > on,
> > unaffected by a Level 3 outage...
> > 
> > Might be nice to know what happened, but my life will certainly not be 
> > less interesting by not having that knowledge...
> 
> 



Re: Level 3 statement concerning 2/23 events (nothing to see, move along)

2004-02-25 Thread Pete Templin
If an IP-based system lets you see the status of the 23 hospitals in San 
Antonio graphically, perhaps overlaid with near-real-time traffic 
conditions, I'd rather use it as primary and telephone as secondary.

Counting on it?  No.  Gaining usability from it?  You betcha.

Brian Knoblauch wrote:

If you're counting on IP (a "best attempt" protocol) for critical
data, you've got a serious design flaw in your system...
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Pete
Templin
Sent: Wednesday, February 25, 2004 9:10
To: Colin Neeson
Cc: [EMAIL PROTECTED]
Subject: Re: Level 3 statement concerning 2/23 events (nothing to see, move
along)


Are you sure no one died as a result?  My hobby is volunteering as a 
firefighter and EMT.  If Level3's network sits between a dispatch center 
or mobile data terminal and a key resource, it could be a factor 
(hospital status website, hazardous materials action guide, VoIP link 
that didn't reroute because the control plane was happy but the 
forwarding plane was sad, etc.).

And if the problem could happen to another network tomorrow but could be 
prevented or patched, wouldn't inquiring minds want to know?  Your life 
might be more interesting when the fit hits the shan if you have the 
same vulnerability.

Colin Neeson wrote:


Because, in the the grand scale scheme of things, it's really not that
important.
No one died because of it, the normal, everyday events of the world 
went
on,
unaffected by a Level 3 outage...

Might be nice to know what happened, but my life will certainly not be
less interesting by not having that knowledge...




__
This message was scanned by GatewayDefender
9:13:43 AM ET - 2/25/2004


__
This message was scanned by GatewayDefender
9:25:39 AM ET - 2/25/2004



Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Jared Mauch

Ok.

I can't sit by here while people speculate about the possible
problems of a network outage.

I think that most everyone here reading NANOG realizes that
the Internet is becoming more and more central to daily life even
for those that are not connected to the internet.

From where i'm sitting, I see a number of potentially dangerous
trends that could result in some quite catastrophic failures of networks.
No, i'm not predicting that the internet will end in 8^H7 days or anything
like that.  I think the Level3 outage as seen from the outside is a clear
case that single providers will continue to have their own network failures
for time to come.  (I just hope daily it's not my employers network ;-) )

So, We're sitting here at the crossroads, where VoIP is 
"coming of age".  Vonage, 8x8 and others are blazing a path that
the rest of the providers are now beginning to gun for.  We've already
read in press releases and articles in the past year how providers
in Canada and the US are moving to VoIP transport within their long-distance
networks.

I keep hear of Frame-Relay and ATM signaling that is going
to happen in large providers MPLS cores.  That's right, your "safe" TDM
based services, will be transported over someones IP backbone first.
This means if they don't protect their IP network, the TDM services could
fail.  These types of CES services are not just limited to Frame and ATM.
(Did anyone with frame/atm/vpn services from Level3 experience the
same outage?)

Now the question of Emergency Services is being posed here but also
in parallel by a number of other people at the FCC.  We've seen the E911
recommendation come out regarding VoIP calls.  How long until a simple
power failure results in the inability to place calls?

Now, i'm not trying to pick on Level3 at all.  The trend I
outline here is very real.  The reliance on the Internet for critical
communications is a trend that continues.  Look at how it was used
on 9/11 for communications when cell and land based telephony networks
were crippled.

The internet has become a very critical part of all of our lives
(some more than others) with banks using VPNs to link their ATMs back into
their corporate network as well as the number of people that use it for
just plain "just in time" bill payment and other things.  I can literally
cancel my home phone line, cell phone and communicate soley with my
internet connection, performing all my bill payments without any paperwork.
I can even file my taxes online.

We're at (or already past) the dangerous point of network
convergence.  While I suspect that nobody directly died as a result of
the recent outage, the trend to link together hospitals, doctors
and other agencies via the Internet and a series of VPN clients continues
to grow.  (I say this knowing how important the internet is to
the medical community, reading x-rays and other data scans at home for the
oncall is quite common).

While my friends that are local VFD do still have the traditional
pager service with towers, etc... how long until the T1's that are
used for dial-in or speaking to the towers are moved to some sort of
IP based system?  The global economy seems to be going this direction with
varying degrees of caution.

I'm concerned, but not worried.. the network will survive..

- Jared


On Wed, Feb 25, 2004 at 09:17:30AM -0600, Pete Templin wrote:
> If an IP-based system lets you see the status of the 23 hospitals in San 
> Antonio graphically, perhaps overlaid with near-real-time traffic 
> conditions, I'd rather use it as primary and telephone as secondary.
> 
> Counting on it?  No.  Gaining usability from it?  You betcha.
> 
> Brian Knoblauch wrote:
> 
> > If you're counting on IP (a "best attempt" protocol) for critical
> >data, you've got a serious design flaw in your system...
> >
> >-Original Message-
> >From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of 
> >Pete
> >Templin
> >Sent: Wednesday, February 25, 2004 9:10
> >To: Colin Neeson
> >Cc: [EMAIL PROTECTED]
> >Subject: Re: Level 3 statement concerning 2/23 events (nothing to see, move
> >along)
> >
> >
> >
> >
> >Are you sure no one died as a result?  My hobby is volunteering as a 
> >firefighter and EMT.  If Level3's network sits between a dispatch center 
> >or mobile data terminal and a key resource, it could be a factor 
> >(hospital status website, hazardous materials action guide, VoIP link 
> >that didn't reroute because the control plane was happy but the 
> >forwarding plane was sad, etc.).
> >
> >And if the problem could happen to another network tomorrow but could be 
> >prevented or patched, wouldn't inquiring minds want to know?  Your life 
> >might be more interesting when the fit hits the shan if you have the 
> >same vulnerability.
> >
> >Colin Neeson wrote:
> >
> >
> >>Because, in the the grand scale scheme of things, it's real

Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Dave Stewart
At 10:52 AM 2/25/2004, you wrote:

recommendation come out regarding VoIP calls.  How long until a simple
power failure results in the inability to place calls?
We're already at that point.  If the power goes out at home, I'd have to 
grab a flashlight and go hunting for a regular ol' POTS-powered phone.  Or 
use the cell phone (as I did when Bubba had a few too many to drink one 
night recently and took out a power transformer).  But I do have a few old 
regular phones.  How many people don't?

Interactive Intelligence, Artisoft and many others are selling businesses 
phone systems that run entirely on  a "server" that may or may not be 
connected to a UPS of sufficient capacity to keep the server running during 
an extended outage.  These systems are frequently handling a PRI instead of 
POTS lines, so there's no backup when the UPS dies.  One the "phone server" 
goes down, no phone service.

VOIP services have the same problem.  Lights go out, that whiz-bang 
handy-dandy VOIP phone doesn't work, either.

Sure, we talking about the end user, not the core/backbone.  But the answer 
to the question, strictly speaking, is that a simple power outage can 
result in many people being unable to make a simple phone call (or at best, 
relying on their cell phones... assuming the generator fired at their 
nearest cell when the lights went out).




Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Steven M. Bellovin

In message <[EMAIL PROTECTED]>, Jared Mauch writes:
>
>   (I know this is treading on a few "what if" scenarios, but it could
>actually mean a lot if we convert to a mostly IP world as I see the trend).
>

I think your analysis is dead-on.

--Steve Bellovin, http://www.research.att.com/~smb




Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Matthew Crocker

	I'm saying that if a network had a FR/ATM/TDM failure in the past
it would be limited to just the FR/ATM/TDM network.  (well, aside from
any IP circuits that are riding that FR/ATM/TDM network).  We're now 
seeing
the change from the TDM based network being the underlying network to 
the
"IP/MPLS Core" being this underlying network.

What it means is that a failure of the IP portion of the network
that disrupts the underlying MPLS/GMPLS/whatnot core that is now
transporting these FR/ATM/TDM services, does pose a risk.  Is the risk
greater than in the past, relying on the TDM/WDM network?  I think that
there could be some more spectacular network failures to come.  Overall
I think people will learn from these to make the resulting networks
more reliable.  (eg: there has been a lot learned as a result of the
NE power outage last year).
Internet traffic should run over an IP/MPLS core in a separate session 
(VRF, Virtual context, whatever..) so the MPLS core never sees the full 
BGP routing information of the Internet.  So long as router vendors can 
provide proper protection between routing instances so one virtual 
router can't consume all memory/cpu; The MPLS core should be pretty 
stable.  The core MPLS network and control plane should be completely 
separate from regular traffic and much less complex for any given 
carrier.  VoIP, Internet, EoM, AToM, FRoM, TDMoM should all run in 
separate sessions all isolated from each other.  A router should act 
like a unix machine treating each MPLS/VRF session as a separate user, 
isolating and protecting users from each other, providing resource 
allocation and limits.  I'm not sure of the effectiveness of current 
generation routers but it should be coming down the line.   That said, 
the IP/MPLS core should be more stable than traditional TDM networks, 
the Internet itself may not stabilize but that shouldn't affect the 
core.  What happened at L3 was an internet outage, that shouldn't in 
theory affect the MPLS core.  Think back 10 years when it was common 
for a unix binary to wipe out a machine by consuming all resources 
(fork bombs anyone?).  Unix machines have come a long way since then.  
Routers need to follow the same progression.  What is the routing 
equivalent of 'while (1) { fork(); };'?  Currently it is massive BGP 
flapping that chew resources.  A good router should be immune to that 
and can be with proper resource management.

-Matt



Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Matthew Crocker

  Is it that sharing fate in the switching fabric (as
  opposed to say, in the transport fabric, or even
  conduit) reduces the resiliency of a given service (in
  this case FR/ATM/TDM), and as such poses the "danger"
  you describe?
Sharing fate in the physical layer (multiple fibers in the same 
conduit) or transport layer (multiple services on the same SONET) have 
clear and well defined resource limits.  A GigE running down a piece of 
fiber will NEVER jump over to the ATM network fiber and wipe it out. 
Same goes with SONET. An STS1 is an STS1 and will never eat up an OC-48 
no matter how much traffic.  Clear well defined resource requirements 
with well defined protection between resources.
shared fate in the switching fabric won't be as stable until routers 
(the switching fabric) can allocate and manage resources in a clear and 
defined way.  If the resources are being over committed the fabric must 
be able to handle the full burden of resource requests while still 
managing to provide appropriate resource limits to services.  QoS plays 
a part in managing the resources of a given link,  what manages the 
resources a service can consume in the fabric itself (CPU, Memory, 
bandwidth).  With proper traffic engineering you can build/overbuild 
the network to handle 'normal' traffic with a great deal of 
reliability.  The switch fabric and/or network itself must be able to 
protect itself from the abnormal.  Limiting memory/CPU consumption of a 
flapping BGP peer so you still have enough resources to handle the AToM 
traffic which is given a higher priority.  Let the BGP peers fail, let 
the Internet traffic drop to save the high priority traffic and the 
MPLS glue traffic to keep the core operational.  Wouldn't it be great 
if routers had the equivalent of 'User mode Linux' each process 
handling a service, isolated and protected from each other.  The 
physical router would be nothing more than a generic kernel handling 
resource allocation.  Each virtual router would have access to x amount 
of resources and will either halt, sleep, crash when it exhausts those 
resources for a given time slice.  I don't know of any method in the 
current router offerings to limit a VRF to x% of CPU and y% of memory.

-Matt


Is this an accurate characterization of your point? If
so, why should sharing fate in the switching fabric
necessarily reduce the resiliency of the those services
that share that fabric (i.e., why should this be so)? I
have some ideas, but I'm interested in what ideas other
folks have.
	Thanks,

	Dave





RE: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Sean Crandall

>From Jared:

>   I keep hear of Frame-Relay and ATM signaling that is 
> going to happen in large providers MPLS cores.  That's right, 
> your "safe" TDM based services, will be transported over 
> someones IP backbone first.
> This means if they don't protect their IP network, the TDM 
> services could fail.  These types of CES services are not 
> just limited to Frame and ATM.
> (Did anyone with frame/atm/vpn services from Level3 
> experience the same outage?)

We use Level3 for IP transit and transport (both DS-3 and Ethernet over
MPLS (via Martini)) all over the country.  As with everyone else, we saw
the problems with the transit traffic out of SJC and ATL.  However, our
transport services were not affected at all by the problems.  In fact, I
just ended up sending my Level3-SJC bound traffic to LAX via Level3
which was going through the same equipment as the transit traffic which
was having problems.

>From Pete:

>  From this, it can be deduced that reducing unneccessary 
> system complexity and shortening the strings of pearls that 
> make up the system contribute to better availablity and 
> resiliency of the system. Diversity works both ways in this 
> equation. It lessens the probablity of same failure hitting 
> majority of your boxes but at the same time increases the 
> knowledge needed to understand and maintain the whole system.
> 
> I would vote for the KISS principle if in doubt.

I agree.  Granted the string of pearls is always going to be pretty
long, but there are definitely is a trend from what I have seen with
customers to make the string longer than it needs to be.

-Sean

Sean P. Crandall
VP Engineering Operations
MegaPath Networks Inc.
6691 Owens Drive
Pleasanton, CA  94588
(925) 201-2530 (office)
(925) 201-2550 (fax)






Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Paul Vixie

[EMAIL PROTECTED] (Jared Mauch) writes:

> ...
>   I keep hear of Frame-Relay and ATM signaling that is going
> to happen in large providers MPLS cores.  That's right, your "safe" TDM
> based services, will be transported over someones IP backbone first.

One of my DS3/DS1 vendors recently told me of a plan to use MPLS for part
of the route inside their switching center.  I said "not with my circuits
you won't".  Once they understood that I was willing to take my business
elsewhere or simply do without, they decided that an M13 was worth having
after all.  My advice is, walk softly but carry a big stick.  When we all
say "everything over IP" that means teaching more devices how to speak
802.11 or other packet-based access protocols rather than giving them ATM
or F/R or dialup modem circuitry.  It does *not* mean simulating an ISO-L1
or ISO-L2 "circuit" using a ISO-L3 "network".  (Ick.)
-- 
Paul Vixie


Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Jeff S Wheeler

On Wed, 2004-02-25 at 13:34, David Meyer wrote:
> Is it that sharing fate in the switching fabric (as
> opposed to say, in the transport fabric, or even
> conduit) reduces the resiliency of a given service (in
> this case FR/ATM/TDM), and as such poses the "danger"
> you describe?

Our vendors will tell us that the IP routing fabrics of today are indeed
quite reliable and resistant to failure, and they may be right when it
comes to hardware MTBF.  However, the IP network relies a great deal
more on shared/inter-domain, real-time configuration (BGP) than do any
traditional telecommunications networks utilizing the tried and true
technologies referenced above.

Yesterday we witnessed a large scale failure that has yet to be
attributed to configuration, software, or hardware; however one need
look no further than the 168.0.0.0/6 thread, or the GBLX customer who
leaked several tens of thousands of their peers' routes to GBLX shortly
before the Level(3) event, to show that configuration-induced failures
in the Internet reach much further than in traditional TDM or single
vendor PVC networks.

The single point of failure we all share is our reliance on a correct
BGP table, populated by our peers and transit providers; and kept free
of errors by those same operators.

-- JSW




RE: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Bora Akyol



> >
> I think it has been proven a few times that physical fate sharing is 
> only a minor contributor to the total connectivity availability while 
> system complexity mostly controlled by software written and 
> operated by 
> imperfect humans contribute a major share to end-to-end availability.
> 
>  From this, it can be deduced that reducing unneccessary system 
> complexity and shortening the strings of pearls that make up 
> the system 
> contribute to better availablity and resiliency of the 
> system. Diversity 
> works both ways in this equation. It lessens the probablity of same 
> failure hitting majority of your boxes but at the same time increases 
> the knowledge needed to understand and maintain the whole system.
> 
> I would vote for the KISS principle if in doubt.

Hi Pete

This train of thought works well for only accidental failures,
unfortunately
if you have an adversary that is bent on disturbing communications
and damaging the critical infrastructure of a country, physical faith
sharing 
makes things less robust than they need to be. By the way, no
disagreement
from me on any of the points you make. Keeping it simple and robust is
definitely
a good first step. Having diverse paths in the fiber infrastructure is
also necessary.

Regards, 

Bora




Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Matthew Crocker

Yesterday we witnessed a large scale failure that has yet to be
attributed to configuration, software, or hardware; however one need
look no further than the 168.0.0.0/6 thread, or the GBLX customer who
leaked several tens of thousands of their peers' routes to GBLX shortly
This should be rewritten 'Or GLBX who LET one of their customers leak 
several tens of thousands of the peers routes...'.  I'm sorry, a 
network should be able to protect itself from its users and customers.  
BGP filters are not that hard to figure out and peer prefix limits 
should be part of every config.  Don't trust the guy at the other end 
of the pipe to do the right thing.

-Matt



RE: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Erik Haagsman

On Wed, 2004-02-25 at 20:16, Bora Akyol wrote:
> This train of thought works well for only accidental failures,
> unfortunately
> if you have an adversary that is bent on disturbing communications
> and damaging the critical infrastructure of a country, physical faith
> sharing 
> makes things less robust than they need to be. By the way, no
> disagreement
> from me on any of the points you make. Keeping it simple and robust is
> definitely
> a good first step. Having diverse paths in the fiber infrastructure is
> also necessary.

I don't think faith sharing prevents us from having diverse paths, since
this is where redundancy comes in. Even if all services run over the
same fibre paths, there isn't any problem as long as there's a
sufficient number of alternative paths in case any of the paths goe
down. 

Cheers,

-- 
---
Erik Haagsman
Network Architect
We Dare BV
tel: +31.10.7507008
fax: +31.10.7507005
http://www.we-dare.nl






Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread David Meyer

Jared,

>>  I keep hear of Frame-Relay and ATM signaling that is going
>> to happen in large providers MPLS cores.  That's right, your "safe" TDM
>> based services, will be transported over someones IP backbone first.
>> This means if they don't protect their IP network, the TDM services could
>> fail.  These types of CES services are not just limited to Frame and ATM.
>> (Did anyone with frame/atm/vpn services from Level3 experience the
>> same outage?)

Is your concern that carrying FR/ATM/TDM over a packet
core (IP or MPLS or ..) will, via some mechanism, reduce
the resilience of the those services, of the packet core,
of both, or something else?

>>  We're at (or already past) the dangerous point of network
>> convergence.  While I suspect that nobody directly died as a result of
>> the recent outage, the trend to link together hospitals, doctors
>> and other agencies via the Internet and a series of VPN clients continues
>> to grow.  (I say this knowing how important the internet is to
>> the medical community, reading x-rays and other data scans at
>> home for the oncall is quite common). 

Again, I'm unclear as to what constitutes "the dangerous
point of network convergence", or for that matter, what
constitutes convergence (I'm sure we have close to a
common understanding, but its worth making that
explicit).  In any event, can you be more explicit about
what you mean here?

Thanks,

Dave





Re: T1 Customer CPE Replacement?

2004-02-25 Thread Curtis Maurand


They're still in business.  They've been bought out, but they're still 
there.  

Curtis

On Tue, 24 Feb 2004, Sameer Khosla wrote:

> Just to add my 2 cents, I have installed a lot of Openroute routers over the
> years, and have had virtually no problems with them.  There is a GTX 1000
> Model that is modular, for which there are T1 CSU modules available.  I
> believe there were DSL modules as well.  There was also a GTX1500 model
> which had the encryption hardware for VPN's.
> 
> If anyone is interested, drop me a line and I'll do some digging with one of
> their former SE's.  He has access to a fair supply of them.
> 
> Sameer
> 
> - Original Message -
> From: "Curtis Maurand" <[EMAIL PROTECTED]>
> To: "Brian Bruns" <[EMAIL PROTECTED]>
> Cc: "Claydon, Tom" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Tuesday, February 24, 2004 1:35 PM
> Subject: Re: T1 Customer CPE Replacement?
> 
> 
> >
> >
> > I had excellent luck with OpenRoute (formerly Proteon) GT90's.  They
> > handle dual ethernet or T1/E1.  They need an external CSU/DSU, but they
> > get the job done and they're very stable.  They will do NAT and most of
> > the other goodies that you can think about.  They also have gt900 firewall
> > and they have a model with a hardware accelerator to handle cryptographic
> > calculations.  I installed one of the latter into a customer about 4 years
> > ago and I've not had any trouble with it, except to upgrade the OS once to
> > fix some wierd packet length issues surrounding IPSEC tunnels.
> >
> > http://www.openroute.com
> >
> > Curtis
> >
> >
> > On Mon, 23 Feb 2004, Brian Bruns wrote:
> >
> > >
> > > On Monday, February 23, 2004 3:37 PM [EST], Claydon, Tom
> > > <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hello,
> > > >
> > > > We're looking for a good replacement for fractional T1 customers with
> Cisco
> > > > 1600-  & 1700-series routers as their CPE. They are good routers, but
> the
> > > > ongoing support costs are an issue, and we need to replace them ASAP.
> > > >
> > > > Someone had mentioned several CPE vendors, such as Adtran and Netopia.
> Are
> > > > there any others, and does anyone have any pros/cons of what they're
> > > > familiar with?
> > > >
> > > >
> > >
> > > I'm quite familiar with the Netopia R53xx series T1 routers.  Excellent
> little
> > > routers for deplyoing to customers.  Very reliable, and if you are
> familiar
> > > with the DSL routers, you'll be right at home.  They have built in
> > > PPTP/ATMP/IPSec VPN support (both client and server), basic routing
> features,
> > > filtering, NAT, one-to-one IP mapping, remote syslog logging, as well as
> > > everything you'd expect in a T1 router (fractional T1 support, HDLC,
> PPP,
> > > FrameRelay, etc).  Theres also a 56k dialup backup module which is
> handy.
> > >
> > >
> >
> > --
> > --
> > Curtis Maurand
> > mailto:[EMAIL PROTECTED]
> > http://www.maurand.com
> >
> >
> 

-- 
--
Curtis Maurand
mailto:[EMAIL PROTECTED]
http://www.maurand.com




Sprint midwest/NW backbone issues?

2004-02-25 Thread Dave O'Shea

Anyone at sprint care to shed some light on a latency
issue that's been going on since December?

(While the SLA credits are always nice, there's
something to be said for actually getting the traffic
from A to B!)

sl-bb20-fw>ping 144.228.241.75
 
 Type escape sequence to abort.
 Sending 5, 100-byte ICMP Echos to 144.228.241.75,
timeout is 2 seconds:
 !
 Success rate is 100 percent (5/5), round-trip
min/avg/max = 84/115/172 ms
 sl-bb20-fw>

... and Seattle to Dallas:
sl-bb20-sea>ping 144.228.241.51
 
 Type escape sequence to abort.
 Sending 5, 100-byte ICMP Echos to 144.228.241.51,
timeout is 2 seconds:
 !
 Success rate is 100 percent (5/5), round-trip
min/avg/max = 76/76/80 ms
 sl-bb20-sea>





Re: New Draft Document: De-boganising New Address Blocks

2004-02-25 Thread Randy Bush

[ nick has trouble posting, so ... ]

Date: Wed, 25 Feb 2004 00:27:00 -0500
From: Nick Feamster <[EMAIL PROTECTED]>
Subject: Re: New Draft Document: De-boganising New Address Blocks
To: [EMAIL PROTECTED]
User-Agent: Mutt/1.4.1i

On Tue, Feb 24, 2004 at 06:28:48PM +0100, Daniel Karrenberg wrote:
> > Why can't ISPs subscribe to a feed of all new 
> > RIPE allocations in near real-time?
> 
> Personally I think this is a great idea and if we hear from a lot of
> operators actually willing to take such feeds it may become reality
> beyond volunteer efforts like the Team CYMRU one.  However there are a
> number of serious issues with something like this, not the least of
> which are the liability issues in case this goes wrong very dynamically
> and semi-automatedly. 
> 
> It is certainly something to progress if there is enough interest.
> 
> However I think the current proposal shold go ahead too because the false
> positives are a real problem that needs to be addressed quickly.
> 

fyi, I have written a configuration checking tool that checks for
a configuration's conformance to the Cymru bogon list.

See:
http://nms.lcs.mit.edu/bgp/rolex/

for more information.  The tool also checks for various other errors
(summarized at http://nms.lcs.mit.edu/bgp/rolex/tests.html)

I also have a writeup that describes the tool in further detail, as well
as an empirical evaluation that supports these observations about bogon
filtering practices (based on results of running the tool on several
ASes).  Let me know if you'd either (1) like a copy of this writeup or
(2) want to help me generate more empirical data (i.e., want to run the
tool on your configs, let me do so, etc.)

Cheers,
-Nick



How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread Steve Gibbard

Having woken up this morning and realized it was raining in my bedroom
(last night was the biggest storm the Bay Area has had since my house got
its new roof last summer), and then having moved from cleaning up that
mess to vacuuming water out of the basement after the city's storm sewer
overflowed (which seems to happen to everybody in my neighborhood a couple
of times a year), I've spent lots of time today thinking about general
expectations of reliability.  In the telecommunications industry, where we
tend to treat reliability as very important and any outage as a disaster,
hopefully the questions I've been coming up with aren't career ending. ;)
With that in mind, how much in the way of reliability problems is it
reasonable to expect our users to accept?

If the Internet is a utility, or more generally infrastructure our society
depends on, it seems there are a bunch of different systems to compare it
to.  In general, if I pick up my landline phone, I expect to get a
dialtone, and I expect to be able to make a call.  If somebody calls my
landline, I expect the phone to ring, and if I'm near the phone I expect
to be able to answer.  Yet, if I want somebody to actually get through to
me reliably, I'll probably give them my cell phone number instead.  If it
rings, I'm far more likely to able to answer it easily than I am my
landline, since the landline phone is in a fixed location.  Yet some
significant portion of calls to or from my cell phone come in when I'm in
areas with bad reception, and the conversation becomes barely
understandable.  In many cases, the signal is too weak to make a call at
all, and those who call me get sent straight to voicemail.  Most of us put
up with this, because we judge mobility to be more important than
reliability.

I don't think I've ever had a natural gas outage that I've noticed, but
most of my gas appliances won't work without electric power.  I seem to
lose electric power at home for a few hours once a year or so, and after
the interuption life tends to resume as it was before.  When power outages
were significantly more frequent, and due to rationing rather than to
accidents, it caused major political problems for the California
government.  There must be some threshold for what people are willing to
accept in terms of residential power outages, that's somewhere above 2-3
hours per year.

In Ann Arbor, Michigan, where I grew up, the whole town tended to pretty
much grind to a halt two or three days a year, when more snow fell than
the city had the resources to deal with.  That quantity of snow necessary
to cause that was probably four or five inches.  My understanding is that
Minneapolis and Washington DC both grind to a halt due to snow with
somewhat similar frequency, but the amount of snow requred is
significantly more in Minneapolis and significantly less in DC.  Again,
there must be some threshold of interruptions due to exceptionally bad
weather that are tolerated, which nobody wants to do worse than and nobody
wants to spend the money to do better than.

So, it appears that among general infrastructure we depend on, there are
probably the following reliability thresholds:

Employees not being able to get to work due to snow: two to three days per
year.
Berkeley storm sewers: overflow two to three days per year.
Residential Electricity: out two to three hours per year.
Cell phone service: Somewhat better than nine fives of reliability ;)
Landline phone service:  I haven't noticed an outage on my home lines in a
few years.
Natural gas: I've never noticed an outage.

How Internet service fits into that of course depends on how you're
accessing the Net.  The T-Mobile GPRS card I got recently seems
significantly less reliable than my cell phone.  My SBC DSL line is almost
to the reliability level of my landline phone or natural gas service,
except that the DSL router in my basement doesn't work when electric power
is out.  I'm probably poorly qualified to talk about the end-user
experience on the networks I actually work on, even if I had permission
to.  Like pretty much everybody else here, I'm always interested in doing
better on reliability.  And, like many of my neighbors, I'd like to be
able to store stuff on my basement floor.  In comparison to a lot of other
infrastructure we depend on, it seems to me the Internet is already doing
pretty well.

-Steve

On Wed, 25 Feb 2004, Jared Mauch wrote:

>
>   Ok.
>
>   I can't sit by here while people speculate about the possible
> problems of a network outage.
>
>   I think that most everyone here reading NANOG realizes that
> the Internet is becoming more and more central to daily life even
> for those that are not connected to the internet.
>
>   From where i'm sitting, I see a number of potentially dangerous
> trends that could result in some quite catastrophic failures of networks.
> No, i'm not predicting that the internet will end in 8^H7 days or anything
> like that.  I think the Level3 outage 

Re: How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread W.D.McKinney


>-Original Message-
>From: Steve Gibbard [mailto:[EMAIL PROTECTED]
>Sent: Thursday, February 26, 2004 12:30 AM
>To: [EMAIL PROTECTED]
>Subject: How relable does the Internet need to be? (Was: Re: Converged Network Threat)



>>So, it appears that among general infrastructure we depend on, there are
>probably the following reliability thresholds:
>
>Employees not being able to get to work due to snow: two to three days per
>year.
>Berkeley storm sewers: overflow two to three days per year.
>Residential Electricity: out two to three hours per year.
>Cell phone service: Somewhat better than nine fives of reliability ;)
>Landline phone service:  I haven't noticed an outage on my home lines in a
>few years.
>Natural gas: I've never noticed an outage.
>
>How Internet service fits into that of course depends on how you're
>accessing the Net.  The T-Mobile GPRS card I got recently seems
>significantly less reliable than my cell phone.  My SBC DSL line is almost
>to the reliability level of my landline phone or natural gas service,
>except that the DSL router in my basement doesn't work when electric power
>is out.  I'm probably poorly qualified to talk about the end-user
>experience on the networks I actually work on, even if I had permission
>to.  Like pretty much everybody else here, I'm always interested in doing
>better on reliability.  And, like many of my neighbors, I'd like to be
>able to store stuff on my basement floor.  In comparison to a lot of other
>infrastructure we depend on, it seems to me the Internet is already doing
>pretty well.
>
>-Steve
>
>

With BPL on the horizon and the Electric Utils looking to de-regulate in some areas, 
it will be interesting to watch infrastructure adapt accordingly.
I think the Internet is doing pretty well save some IOS code problems from time to 
time, and the typical root server hicups.

Dee
 








RE: How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread Bora Akyol

It needs to be as reliable as the services that depend on it.

E.g. if bank A is using the Internet exclusively without
leased line back up to run its ATMs, or to interface with
its customers, then it needs to be VERY reliable.

If it's just my kid checking his email on AOL, probably
not that reliable.

As more and more critical services/infrastructure moves
to the IP/MPLS, the expectations in terms of reliability
go up every year. The real questions are:

* How much are the customer's willing to pay for it?
* What kind of reporting/management infrastructure we have
to enforce/monitor the reliability commitment in the SLA?

The discussion today about FR/ATM running over an MPLS core
was very interesting since bank A may in fact think they have
a back up FR circuit but they may not know that their FR circuit is
in fact running over the same IP/MPLS core. Surprise, surprise :-)

Bora

ps. I am located about 100 miles south of SF and I was very happy
that my cable modem service was up all day :-)



Re: How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread Joe Abley


On 26 Feb 2004, at 08:46, W.D.McKinney wrote:

I think the Internet is doing pretty well save some IOS code problems 
from time to time, and the typical root server hicups.
I'm interested to know what you mean by "typical root server hicups". 
I'm trying to think of an incident which left the Internet generally 
unable to receive answers to queries on the root zone, but I can't 
think of one.

By "typical", do you mean "non-existent"?

Joe



Re: How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread W.D.McKinney

Thanks for pointing that out. That was the wrong way to describe my standpoint. 
Frequent changes in DNS across the board, including edge servers 
make connections seem non-working, when in reality it is a mis-configured DNS zone. So 
whether 

Dee


>-Original Message-
>From: Joe Abley [mailto:[EMAIL PROTECTED]
>Sent: Thursday, February 26, 2004 12:57 AM
>To: 'W.D.McKinney'
>Cc: [EMAIL PROTECTED]
>Subject: Re: How reliable does the Internet need to be? (Was: Re: Converged  Network 
>Threat)
>
>
>
>On 26 Feb 2004, at 08:46, W.D.McKinney wrote:
>
>> I think the Internet is doing pretty well save some IOS code problems 
>> from time to time, and the typical root server hicups.
>
>I'm interested to know what you mean by "typical root server hicups". 
>I'm trying to think of an incident which left the Internet generally 
>unable to receive answers to queries on the root zone, but I can't 
>think of one.
>
>By "typical", do you mean "non-existent"?
>
>
>Joe
>
>





Re: How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread Chris Yarnell

> code problems from time to time, and the typical root server hicups.

Which hicups are those?


Re: How relable does the Internet need to be? (Was: Re: Converged Network Threat)

2004-02-25 Thread joshua sahala
On (25/02/04 16:30), Steve Gibbard wrote:
> 
> With that in mind, how much in the way of reliability problems is it
> reasonable to expect our users to accept?

probably something more than we tell them it will be down, but less than
we would (secretly) hope - most users tend to complain if it becomes 
uncomfortable to them and they think that calling might make it better.

> 
> If the Internet is a utility, or more generally infrastructure our society
> depends on, it seems there are a bunch of different systems to compare it
> to. 

don't forget such useful things as (snail) mail and trash collection -
we tend to accept more problems with mail (except around certain
holidays)...but if we want more reliability or responsiveness, we pay
extra (or choose a different carrier).  trash is forgiving only to the
point that it isn't making things uncomfortable, ie the stench isn't
overwhelming the can of air-freshener ;)
while it is true that we accept mobility over reliability on our cell
phones, we are becoming less and less forgiving of this (hence the race
to blanket the country with cell towers).  we compare cell servive to
landline service, and we accepted that it would take time to get better
coverage, but now it must work all the time, everywhere...

> 
> There must be some threshold for what people are willing to accept in 
> terms of residential power outages, that's somewhere above 2-3 hours 
> per year.

two or three hours a year would be wonderful here (southern florida),
but the grid is old and very succeptible to lightning (or cars) taking
out a transformer/relay/etc - i agree though, there is a threshold,
which in this case is 'configurable' in the sense that users can be
conditioned to accept worse and worse service.

> 
> So, it appears that among general infrastructure we depend on, there are
> probably the following reliability thresholds:
> 

mail - about twice as long (2-3 day first class taking 5-6), but
dependent upon the importance as perceived by the customer

trash - smell not overpowering, and bins not overflowing too badly,
presence of rats or cockroaches will reduce the threshold though ;)

> 
> How Internet service fits into that of course depends on how you're
> accessing the Net.

based somewhat upon what the customer thinks the reliability should be,
and what they are conditioned to accept - everyone here asks their
friends/coworkers who has the best dsl/cable/email/cell/etc service and
price.  this is also the reason that many of us run our own mail/web/etc
servers, so that we have a better idea of what to expect (if operator
error is going to render my email useless, i want it to be my error...)
this brings up another point, we like to be able to 'blame' the error on
someone/thing...if i hose my server, well then i'm an idiot...if my dsl
provider reloads their transit router, then they are the idiot...if the
driver in front of me is going too slow in rush hour and a semi pulls in
ahead...but i digress.
in the race to put more 9's on the company website we have created the
situation where there are (in some cases), unrealistic expectations.
these expectations have not yet been tempered by time or reality, partly
because we (network operators) have done a pretty good job of running
this internet thing in an almost reliable manner.  when something goes
wrong, we do our best to prevent that from happening again (for at least
the next month or two).
as to the question of how reliable do the users expect it to be, i
believe that it is a semi-individual thing:  as a user, i expect (or
should i say hope) it to be available when i need/want to use it, but 
as an operator, i can understand how/why it isn't (but i don't always 
like it ;) )  
the internet is as important as the service we run over it...the more
vital (or money-making), the higher the expectation - especially
when it is a service that we already have

my $0.02

/joshua
-- 
Fixing Unix is easier than living with NT.
Jonathan Gilpin


signature.asc
Description: Digital signature


Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread dan

Convergence, and our "lust" to throw TDM/ATM infrastructure in the garbge
is an area very near and dear to my heart.

I apologize if I am being a bit redundant here... but from our
perspective, we are an ISP that is under a lot of pressure to deploy a
VoIP solution.  I just don't think we can... It's just not reliable enough
yet. Period.

In a TDM environment the end node switch is incredibly reliable.  I can't
ever remember in my 30 years on this earth when the end node my telephone
was connected to was EVER down, not once, not EVER.  A circuit switch
environment gives us inherint admission control (if there are not enough
tandem/interswitch trunks we just get a fast busy).  This allows them to
guarantee end to end quality.  The one problem, is that if any of the
tandems along the path my call is connected get nuked off the face of the
earth, I am completely off the air.

In an IP (packet based) environment, theoretically routing protocols can
reroute my call while it is in progress if a catstrophic event occurs,
like the entire NE losing power. The inherint problem with IP is that it
has no admission control, and that it's fundamental resliant design was to
make sure that the "core" of the network knew nothing about the flows
within, so that it _could_ survive a failure.  This design goal is the
problem when trying to guarantee end to end quality of service.  Without
admission control, we can pack it full, so that nothing works 
Variable length frames mean that we have little idea of what is coming
down the pipe next.

This can all be solved by massivly overbuilding our network.

Other than the occasional DoS against an area of the network, outages
caused by overuse are relativley rare

Yhe big problem is the end node hardware in IP networks.  Routers crash
ALL the time it is actually a joke.  Yes, theoretically a user could
have 3 separate connections to the Internet and use their VoIP phone and
be happy, but that is not the case.  They buy Internet service from one
place, that is aggregated in the same building as that TDM end node in the
voice world(usually).  That aggregation (access) layer is the single
biggest vulnerability in both worlds.  It just does not fail in the TDM
world like it does in the IP world.  We need to find ways to make that
work better in the IP world so it can be as reliable as the TMD world.  I
realize that us (the public) are asking IP hardware vendors for new
features far faster than can be released reliably... but surely we can
find ways to fail it over more effectivley than it does now...


Dan.







Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Jared Mauch

On Wed, Feb 25, 2004 at 09:44:51AM -0800, David Meyer wrote:
>   Jared,
> 
> >>I keep hear of Frame-Relay and ATM signaling that is going
> >> to happen in large providers MPLS cores.  That's right, your "safe" TDM
> >> based services, will be transported over someones IP backbone first.
> >> This means if they don't protect their IP network, the TDM services could
> >> fail.  These types of CES services are not just limited to Frame and ATM.
> >> (Did anyone with frame/atm/vpn services from Level3 experience the
> >> same outage?)
> 
>   Is your concern that carrying FR/ATM/TDM over a packet
>   core (IP or MPLS or ..) will, via some mechanism, reduce
>   the resilience of the those services, of the packet core,
>   of both, or something else?

I'm saying that if a network had a FR/ATM/TDM failure in the past
it would be limited to just the FR/ATM/TDM network.  (well, aside from
any IP circuits that are riding that FR/ATM/TDM network).  We're now seeing
the change from the TDM based network being the underlying network to the
"IP/MPLS Core" being this underlying network.

What it means is that a failure of the IP portion of the network
that disrupts the underlying MPLS/GMPLS/whatnot core that is now 
transporting these FR/ATM/TDM services, does pose a risk.  Is the risk
greater than in the past, relying on the TDM/WDM network?  I think that
there could be some more spectacular network failures to come.  Overall
I think people will learn from these to make the resulting networks
more reliable.  (eg: there has been a lot learned as a result of the
NE power outage last year).

> >>We're at (or already past) the dangerous point of network
> >> convergence.  While I suspect that nobody directly died as a result of
> >> the recent outage, the trend to link together hospitals, doctors
> >> and other agencies via the Internet and a series of VPN clients continues
> >> to grow.  (I say this knowing how important the internet is to
> >> the medical community, reading x-rays and other data scans at
> >> home for the oncall is quite common). 
> 
>   Again, I'm unclear as to what constitutes "the dangerous
>   point of network convergence", or for that matter, what
>   constitutes convergence (I'm sure we have close to a
>   common understanding, but its worth making that
>   explicit).  In any event, can you be more explicit about
>   what you mean here?

Transporting FR/ATM/TDM/Voice over the IP/MPLS core, as well as
some of the technology shifts (VoIP, Voice over Cable, etc..) are removing
some of the resiliance from the end-user network that existed in the past.

I think that most companies that offer frame-relay which also
have a IP network are looking at moving their frame-relay on to their IP
network.  (I could be wrong here clearly).  This means that overall we need
to continue to provide a more reliable IP network than in the past.  It
is critically important.  I think that Pete Templin is right to question
peoples statements that "nobody died because of a network outage".  While
I think that the answer is likely No, will that be the case in 2-3 years
as Qwest, SBC, Verizon, and others move to a more native VoIP infrastructure?

A failure within their IP network could result in some emergency
calling (eg: 911) not working.  While there are alternate means of calling
for help (cell phone, etc..) that may not rely upon the same network elements
that have failed, some people would consider a 60 second delay as you
switch contact methods too long and an excessive risk to someones health.

I think it bolsters the case for personal emergency preparedness,
but also spending more time looking at the services you purchase.  If
you are relying on a private frame-relay circuit as backup for your VPN over
the public internet, knowing if this is switched over an IP network becomes
more important.

(I know this is treading on a few "what if" scenarios, but it could
actually mean a lot if we convert to a mostly IP world as I see the trend).

- jared

-- 
Jared Mauch  | pgp key available via finger from [EMAIL PROTECTED]
clue++;  | http://puck.nether.net/~jared/  My statements are only mine.


Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread David Meyer

Jared,

>> >Is your concern that carrying FR/ATM/TDM over a packet
>> >core (IP or MPLS or ..) will, via some mechanism, reduce
>> >the resilience of the those services, of the packet core,
>> >of both, or something else?
>> 
>>  I'm saying that if a network had a FR/ATM/TDM failure in
>> the past it would be limited to just the FR/ATM/TDM network.
>> (well, aside from any IP circuits that are riding that FR/ATM/TDM
>> network).  We're now seeing the change from the TDM based
>> network being the underlying network to the "IP/MPLS Core"
>> being this underlying network. 
>> 
>>  What it means is that a failure of the IP portion of the network
>> that disrupts the underlying MPLS/GMPLS/whatnot core that is now 
>> transporting these FR/ATM/TDM services, does pose a risk.  Is the risk
>> greater than in the past, relying on the TDM/WDM network?  I think that
>> there could be some more spectacular network failures to come.  Overall
>> I think people will learn from these to make the resulting networks
>> more reliable.  (eg: there has been a lot learned as a result of the
>> NE power outage last year).

I think folks can almost certainly agree that when you
share fate, well, you share fate. But maybe there is
something else here. Many of these services have always
shared fate at the transport level; that is, in most
cases, I didn't have a separate fiber plant/DWDM
infrastructure for FR/ATM/TDM, IP, Service X, etc,  so
fate was already being/has always been shared in the
transport infrastructure. 

So maybe try this question: 

  Is it that sharing fate in the switching fabric (as
  opposed to say, in the transport fabric, or even
  conduit) reduces the resiliency of a given service (in
  this case FR/ATM/TDM), and as such poses the "danger"
  you describe?

Is this an accurate characterization of your point? If
so, why should sharing fate in the switching fabric
necessarily reduce the resiliency of the those services
that share that fabric (i.e., why should this be so)? I
have some ideas, but I'm interested in what ideas other
folks have.   

Thanks,

Dave




Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Petri Helenius
David Meyer wrote:

	Is this an accurate characterization of your point? If
	so, why should sharing fate in the switching fabric
	necessarily reduce the resiliency of the those services
	that share that fabric (i.e., why should this be so)? I
	have some ideas, but I'm interested in what ideas other
	folks have.   
 

I think it has been proven a few times that physical fate sharing is 
only a minor contributor to the total connectivity availability while 
system complexity mostly controlled by software written and operated by 
imperfect humans contribute a major share to end-to-end availability.

From this, it can be deduced that reducing unneccessary system 
complexity and shortening the strings of pearls that make up the system 
contribute to better availablity and resiliency of the system. Diversity 
works both ways in this equation. It lessens the probablity of same 
failure hitting majority of your boxes but at the same time increases 
the knowledge needed to understand and maintain the whole system.

I would vote for the KISS principle if in doubt.

Pete



Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Jared Mauch

On Wed, Feb 25, 2004 at 10:34:55AM -0800, David Meyer wrote:
>   Jared,
> 
> >> >  Is your concern that carrying FR/ATM/TDM over a packet
> >> >  core (IP or MPLS or ..) will, via some mechanism, reduce
> >> >  the resilience of the those services, of the packet core,
> >> >  of both, or something else?
> >> 
> >>I'm saying that if a network had a FR/ATM/TDM failure in
> >> the past it would be limited to just the FR/ATM/TDM network.
> >> (well, aside from any IP circuits that are riding that FR/ATM/TDM
> >> network).  We're now seeing the change from the TDM based
> >> network being the underlying network to the "IP/MPLS Core"
> >> being this underlying network. 
> >> 
> >>What it means is that a failure of the IP portion of the network
> >> that disrupts the underlying MPLS/GMPLS/whatnot core that is now 
> >> transporting these FR/ATM/TDM services, does pose a risk.  Is the risk
> >> greater than in the past, relying on the TDM/WDM network?  I think that
> >> there could be some more spectacular network failures to come.  Overall
> >> I think people will learn from these to make the resulting networks
> >> more reliable.  (eg: there has been a lot learned as a result of the
> >> NE power outage last year).
> 
>   I think folks can almost certainly agree that when you
>   share fate, well, you share fate. But maybe there is
>   something else here. Many of these services have always
>   shared fate at the transport level; that is, in most
>   cases, I didn't have a separate fiber plant/DWDM
>   infrastructure for FR/ATM/TDM, IP, Service X, etc,  so
>   fate was already being/has always been shared in the
>   transport infrastructure. 
> 
>   So maybe try this question: 
> 
> Is it that sharing fate in the switching fabric (as
> opposed to say, in the transport fabric, or even
> conduit) reduces the resiliency of a given service (in
> this case FR/ATM/TDM), and as such poses the "danger"
> you describe?

I think the threat is that the switching fabric and
forwarding plane can be disrupted by more things than exist in a 
pure TDM based network.  This isn't to say that the packet (or even
label) network isn't the "future" of these services, it's just
that today there are some interesting problems that still exist as
the technology continues to mature.

>   Is this an accurate characterization of your point? If
>   so, why should sharing fate in the switching fabric
>   necessarily reduce the resiliency of the those services
>   that share that fabric (i.e., why should this be so)? I
>   have some ideas, but I'm interested in what ideas other
>   folks have.   

I believe that there still exist a number of cases where the
switching fabric can get out-of-sync with the control-plane.

If events are not properly triggered back upstream (ie: adjencies
stay up, bgp remains fairly stable) and you end up dumping a lot of
traffic on the floor, it's sometimes a bit more dificult to diagnose
than loss of light on a physical path.

On the sunny side, I see this improving over time.  Software
bugs will be squashed.  Poorly designed networks will be reconfigured to
better handle these situations.

- jared

-- 
Jared Mauch  | pgp key available via finger from [EMAIL PROTECTED]
clue++;  | http://puck.nether.net/~jared/  My statements are only mine.


Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Petri Helenius
Jared Mauch wrote:

	On the sunny side, I see this improving over time.  Software
bugs will be squashed.  Poorly designed networks will be reconfigured to
better handle these situations.
 

The trend running against these points is the added features and 
complexity into the software due to market requirements. So while the 
box you got two years ago might have less bugs today, there are more 
attractive new devices with new bugs in the old and new features. People 
seem to be quite convinced that if you put more features into a box, 
people will pay more for it.

On your second point, it seems that most network protocols are 
converging towards port TCP/80. So unless network performance and 
availability degrades really badly, most users are indifferent and the 
1st level helpdesk at their provider tells that "at times the internet 
might be slow" and they usually are quite happy and understanding with 
that answer because they don´t know that it could be better.

So outside Fortune 500 and some clueful individuals, where is the market 
for non-poorly designed bug free "Internet"?

Pete



Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread David Meyer

Petri,

>> I think it has been proven a few times that physical fate sharing is 
>> only a minor contributor to the total connectivity availability while 
>> system complexity mostly controlled by software written and operated by 
>> imperfect humans contribute a major share to end-to-end availability.

Yes, and at the very least would seem to match our
intuition and experience. 

>> From this, it can be deduced that reducing unneccessary system 
>> complexity and shortening the strings of pearls that make up the system 
>> contribute to better availablity and resiliency of the system. Diversity 
>> works both ways in this equation. It lessens the probablity of same 
>> failure hitting majority of your boxes but at the same time increases 
>> the knowledge needed to understand and maintain the whole system.

No doubt. However, the problem is: What constitutes
"unnecessary system complexity"? A designed system's
robustness comes in part from its complexity. So its not
that complexity is inherently bad; rather, it is just
that you wind up with extreme sensitivity to outlying
events which is exhibited by catastrophic cascading
failures if you push a system's complexity past some
point; these are the so-called "robust yet fragile"
systems (think NE power outage).  

BTW, the extreme sensitivity to outlying events/catastrophic
cascading failures property is a signature of class of
dynamic systems of which we believe the Internet is an
example; unfortunately, the machinery we currently have
(in dynamical systems theory) isn't yet mature enough to
provide us with engineering rules.

>> I would vote for the KISS principle if in doubt.

Truly. See RFC 3439 and/or
http://www.1-4-5.net/~dmm/complexity_and_the_internet. I
also said a few words about this topic at NANOG26
where we has a panel on this topic (my slides on 
http://www.maoz.com/~dmm/NANOG26/complexity_panel).

Dave




Re: Converged Networks Threat (Was: Level3 Outage)

2004-02-25 Thread Petri Helenius
David Meyer wrote:

	
	No doubt. However, the problem is: What constitutes
	"unnecessary system complexity"? A designed system's
	robustness comes in part from its complexity. So its not
	that complexity is inherently bad; rather, it is just
	that you wind up with extreme sensitivity to outlying
	events which is exhibited by catastrophic cascading
	failures if you push a system's complexity past some
	point; these are the so-called "robust yet fragile"
	systems (think NE power outage).  
 

I think you hit the nail on the head. I view complexity as diminishing 
returns play. When you increase complexity, the increase does benefit a 
decreasing percentage of the users. A way to manage complexity is 
splitting large systems into smaller pieces and try to make the pieces 
independent enough to survive a failure of neighboring piece. This 
approach exists at least in the marketing materials of many 
telecommunications equipment vendors. The question then becomes, "what 
good is a backbone router without BGP process". So far I haven´t seen a 
router with a disposable entity on interface or peer basis. So if a BGP 
speaker to 10.1.1.1 crashes the system would still be able to maintain 
relationship to 10.2.2.2. Obviously the point of single device 
availability becomes moot if we can figure out a way to route/switch 
around the failed device quickly enough. Today we don´t even have a 
generic IP layer liveness protocol so by default packets will be 
blackholed for a definite duration until a routing protocol starts to 
miss it´s hello packets. (I´m aware of work towards this goal)

In summary, I feel systems should be designed to run independent in all 
failure modes. If you lose 1-n neighbors the system should be 
self-sufficient on figuring out near-immediately the situation, continue 
working while negotiating with neighbors about the overall picture.

Pete