Re: Router crash unplugs 1m Swedish Internet users

2003-06-24 Thread Petri Helenius

> 
> I've seen a case where a single error in the
> configuration file of a $VENDOR_1 router was accepted
> (due to an 'undocumented feature'), and this caused
> the wholesale importation of BGP routes into the IGP,
> which caused most of their $VENDOR_2 hardware to spaz
> out.  Locating the single error was a matter of hours,
> not minutes, so effectively a typo took out that ISP -
> and it's considered by most to be a relatively
> well-designed network.
> 
I have also seen a variation of this where the boxes which got flooded 
by large IGP tables run out of memory and not recovering (because there
was no memory left) after the broken router was withdrawn. Eventually
the network got fixed by restarting every box in succession.

Not sure if there are safeguards for this now or if everybody buys all their
IGP routers with 512M or more.

Pete



RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Ben Buxton

> On Mon, 23 Jun 2003, Jim Deleskie wrote:
> 
> > One router and it takes there entire network off-line... 
> Maybe someone needs
> > a Intro to Networks 101 class.
> 
> Well, if the memory errors corrupts the forwarding table 
> placed on the 
> line cards or something similar, and still keeps its 
> adjacancies up, then 
> you can get these problems. I've seen it happen on 
> route-cache boxes where 
> certain entries in the ip-forwarding table was corrupted and thus 
> incorrectly routed.

Oooh I had one of these once - a bug in the forwarding engine of a J-box
caused all through traffic to drop whilst maintaining adjacencies and
VRRP mastership. It took two incidents to determine what the cause of
the fault was.

So much for redundancy...

Fortunately it was at the tail end of a maintenance window.
Unfortunately
it affected a similar size user base to the Telia fault (but luckily it
seems no-one noticed enough to publicise it :)

BB


Re: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Mans Nilsson
Subject: Router crash unplugs 1m Swedish Internet users Date: Mon, Jun 23, 2003 at 
04:24:27PM -0400 Quoting Sean Donelan ([EMAIL PROTECTED]):
> 
> 
> Has anyone heard what the cause of the outage was?

Mikael wrote about memory shortage. I have heard the same -- though
not from press contacts but from staff. It was worded (but in
swedish, so bear with my translation): "The official reason is
'memory shortage'. I do believe it is correct."

There have been words in the grapevine about not going for full memory on 
line cards and RP, for "optimisation reasons". Sounds like a fine recipe 
for promoting cascading failures from a fragile base config. 

-- 
Måns Nilsson Systems Specialist
+46 70 681 7204 KTHNOC
MN1334-RIPE

I represent a sardine!!


pgp0.pgp
Description: PGP signature


RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Vadim Antonov


On Mon, 23 Jun 2003, Jim Deleskie wrote:

> One router and it takes there entire network off-line... Maybe someone needs
> a Intro to Networks 101 class.

No matter what kind of technology or design you have there are always
kinds of faults which may bring the entire system down.  The problem is
generally in recognizing when a fault has occured, so the the operation
may be switched over to a backup.

Particularly, the present Internet routing architecture is (mis)designed
in such a way that it is incredibly easy for a local fault or human error
to bring a significant portion of the network down.  Even single-box
_hardware_ faults may lead to global crashes.

Long long time ago I had to track down a problem which made US and EU
pretty much disconnected for several hours. This turned out to be a
hardware problem in 7000's SSE card, which happily worked with packets
originating and terminating in the router itself, but silently dropped all
transit packets.  Voila!  Neighbour boxes were convinced that this one's
working - because all routing protocols were happy, and were trying to
send lots of traffic through it, which was simply going to a blackhole to
the mighty annoyance of everyone.  I've got a speeding ticket showing over
100mph on Dulles hwy at 3am, too, as a memento of rushing to DC with a
spare card...

So, in the absense of details, I would reserve judgement on soundness of
design practices.

--vadim



RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Stewart, William C (Bill), RTSLS

Jim wrote:
> One router and it takes there entire network off-line... 
> Maybe someone needs a Intro to Networks 101 class.

I assume things are designed in such a way that if the router were
actually dead, the traffic would take an alternate route.
But the posting commented that they'd been saying something about memory corruption.
There are unfortunately too many ways for a router to be 
"not dead yet", happily answering routing protocol messages
but not bothering to actually forward packets between network interfaces,
and if that happens on the router that's your best route due to
geography or BGP or whatever, it can take a while to catch.
Dealing with that is at least Networks 203 or maybe Networks 532 :-)

Additionally, while the article in the press referred to it as a "router",
that may be an actual technical description accurately described
by a reporter who knows the technology, or it may be press shorthand
for "one of those high-tech thingies that ISPs use",
or it may be the ISP's Speaker-To-Reporters's watered-down description
of something.


  


RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Jim Deleskie

I've lived though one of these a few years ago, the core itself stayed up
though crippled as it was :)

David, you name sounds familiar have we worked @ the same place before?

-Jim

-Original Message-
From: David Barak [mailto:[EMAIL PROTECTED]
Sent: Monday, June 23, 2003 5:27 PM
To: Jim Deleskie; [EMAIL PROTECTED]
Subject: RE: Router crash unplugs 1m Swedish Internet users


I've seen a case where a single error in the
configuration file of a $VENDOR_1 router was accepted
(due to an 'undocumented feature'), and this caused
the wholesale importation of BGP routes into the IGP,
which caused most of their $VENDOR_2 hardware to spaz
out.  Locating the single error was a matter of hours,
not minutes, so effectively a typo took out that ISP -
and it's considered by most to be a relatively
well-designed network.

-David Barak

--- Jim Deleskie <[EMAIL PROTECTED]> wrote:
> 
> 
> One router and it takes there entire network
> off-line... Maybe someone needs
> a Intro to Networks 101 class.
> 
> -jim
>

=
David Barak
-fully RFC 1925 compliant-

__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com


RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread David Barak

I've seen a case where a single error in the
configuration file of a $VENDOR_1 router was accepted
(due to an 'undocumented feature'), and this caused
the wholesale importation of BGP routes into the IGP,
which caused most of their $VENDOR_2 hardware to spaz
out.  Locating the single error was a matter of hours,
not minutes, so effectively a typo took out that ISP -
and it's considered by most to be a relatively
well-designed network.

-David Barak

--- Jim Deleskie <[EMAIL PROTECTED]> wrote:
> 
> 
> One router and it takes there entire network
> off-line... Maybe someone needs
> a Intro to Networks 101 class.
> 
> -jim
>

=
David Barak
-fully RFC 1925 compliant-

__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com


RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Mikael Abrahamsson

On Mon, 23 Jun 2003, Jim Deleskie wrote:

> One router and it takes there entire network off-line... Maybe someone needs
> a Intro to Networks 101 class.

Well, if the memory errors corrupts the forwarding table placed on the 
line cards or something similar, and still keeps its adjacancies up, then 
you can get these problems. I've seen it happen on route-cache boxes where 
certain entries in the ip-forwarding table was corrupted and thus 
incorrectly routed.

It could be that they ran out of memory on linecards as well, perhaps 
injected too many routes etc, and lost dCEF (dunno if the problems was on 
gsr or juniper), been there, done that.

-- 
Mikael Abrahamssonemail: [EMAIL PROTECTED]



RE: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Jim Deleskie


One router and it takes there entire network off-line... Maybe someone needs
a Intro to Networks 101 class.

-jim

-Original Message-
From: Sean Donelan [mailto:[EMAIL PROTECTED]
Sent: Monday, June 23, 2003 4:24 PM
To: [EMAIL PROTECTED]
Subject: Router crash unplugs 1m Swedish Internet users




Has anyone heard what the cause of the outage was?


Router crash unplugs 1m Swedish Internet users
Saturday, 21 June 2003

The breakdown of one of Sweden's main Internet routers in Stockholmon
today unplugged more than 1 million of its Internet subscribers.

Reports says in total over 340,000 broadband and 700,000 dial-up customers
across the country were affected by the incident.

The router failure might also have caused disruptions to other Internet
subscribers, who use the services of providers operating on the Telia
network.


http://www.abc.net.au/science/news/scitech/SciTechRepublish_885166.htm


Re: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Christopher L. Morrow

stupi.net was offline??

On Mon, 23 Jun 2003, Sean Donelan wrote:

>
>
> Has anyone heard what the cause of the outage was?
>
>
> Router crash unplugs 1m Swedish Internet users
> Saturday, 21 June 2003
>
> The breakdown of one of Sweden's main Internet routers in Stockholmon
> today unplugged more than 1 million of its Internet subscribers.
>
> Reports says in total over 340,000 broadband and 700,000 dial-up customers
> across the country were affected by the incident.
>
> The router failure might also have caused disruptions to other Internet
> subscribers, who use the services of providers operating on the Telia
> network.
>
>
> http://www.abc.net.au/science/news/scitech/SciTechRepublish_885166.htm
>


Re: Router crash unplugs 1m Swedish Internet users

2003-06-23 Thread Mikael Abrahamsson

On Mon, 23 Jun 2003, Sean Donelan wrote:

> Has anyone heard what the cause of the outage was?

The official story was memory fault of some kind, not specified as being
corruption, hardware error, fragmentation or something else. Outage was 3
hours and reports have been posted stating that it not only affected their
broadband business but also their company/commercial customers.

No further details have been released to the swedish ISP community anyway, 
it's likely that they're still investigating and might or might not 
release further details.

I'm also curious.

-- 
Mikael Abrahamssonemail: [EMAIL PROTECTED]