Re: bgp convergence problem

2014-05-08 Thread Christopher Morrow
On Thu, May 8, 2014 at 1:51 AM, Mark Tinka mark.ti...@seacom.mu wrote:
 On Wednesday, May 07, 2014 07:28:46 PM Peter Rubenstein
 wrote:

 Operationally speaking, AS1 should not be leaking routes
 from one upstream to the other.  Bad route policy.

ideally it'd be nice to be valley-free... so to speak.

 Also, AS3 should not accept routes from AS1 that don't
 belong to it.  Customer router filtering would prevent
 this.

always with the route filtering... routes want to be free man, free!

 How I wish this happened in real life.

 We are chasing route leaks several AS's down the path that
 are not even remotely connected to us on a weekly basis. But
 I guess that's what they pay us for :-(.

if only there were some technology that could be used to thwart such things.


Re: bgp convergence problem

2014-05-08 Thread Mark Tinka
On Thursday, May 08, 2014 04:41:21 PM Christopher Morrow 
wrote:

 if only there were some technology that could be used to
 thwart such things.

It's gotten to a point where a repeat offender has me wound 
up enough to prepend his AS into some of my paths.

I wish there was a simpler way to turn them off.

Mark.


signature.asc
Description: This is a digitally signed message part.


Re: bgp convergence problem

2014-05-08 Thread Christopher Morrow
On Thu, May 8, 2014 at 10:51 AM, Mark Tinka mark.ti...@seacom.mu wrote:
 On Thursday, May 08, 2014 04:41:21 PM Christopher Morrow
 wrote:

 if only there were some technology that could be used to
 thwart such things.

 It's gotten to a point where a repeat offender has me wound
 up enough to prepend his AS into some of my paths.

 I wish there was a simpler way to turn them off.

:( that's bad news... config hackery is brittle.
(but fun)


Re: bgp convergence problem

2014-05-08 Thread Mark Tinka
On Thursday, May 08, 2014 06:34:14 PM Christopher Morrow 
wrote:

 :( that's bad news... config hackery is brittle.
 (but fun)

Don't I know :-)... *sigh*

Mark.


signature.asc
Description: This is a digitally signed message part.


RE: bgp convergence problem

2014-05-07 Thread Peter Rubenstein
Operationally speaking, AS1 should not be leaking routes from one upstream to 
the other.  Bad route policy.  Also, AS3 should not accept routes from AS1 that 
don't belong to it.  Customer router filtering would prevent this.

 -Original Message-
 From: NANOG [mailto:nanog-boun...@nanog.org] On Behalf Of Song Li
 Sent: Monday, May 5, 2014 11:59 PM
 To: NANOG
 Subject: bgp convergence problem
 
 Hi everyone,
 
 I have one bgp convergence problem which confused me. The problem is as
 follows:
 
  ++
  |  AS5   |
   +--+16.1/16 |
   |  +-+--+
   ||
   +---+--+ |
   | AS4  | |
   |  | |
   ++-+ |
|   |
|   |
|   |
 +-+--+  +-+-+
 |  AS2   |  | AS3   | 16.1/16 (5)
 |  ISP   |  | ISP   |
 +---^+  +---^---+
  |   |
  | ++|
  +-+  AS1   ++
|customer|
++
   16.1/16 (2 4 5)
 
 AS1 multihomed to AS2 and AS3, for some reasons AS1 disconnect from AS3,
 and as a resutl the route to 16.1/16 will be 16.1/16 (2 4 5).
 
 After a while, the BGP seesion between AS1 and AS3 reestablished  but
 AS1 leaks the route 16.1/16 (2 4 5) to AS3. At this point,
 
 1/ AS1 will have two bgp routes for prefix 16.1/16: 16.1/16(2 4 5)and
 16.1/16(3 5), according to shorter AS_PATH it will select 16.1/16(3 5) as best
 route.
 
 2/ AS3 also have two bgp routes: 16.1/16(2 4 5) and 16.1/16(5), according to
 local_pref it will select 16.1/16(2 4 5).
 
 in this case, AS1 and AS3 select each other as the best route to AS5, i wonder
 which route will be the final best route after bgp convergence in
 AS1 and AS3.
 
 Thanks!
 
 --
 Song Li
 Room 4-204, FIT Building,
 Network Security,
 Department of Electronic Engineering,
 Tsinghua University, Beijing 100084, China Tel:( +86) 010-62446440
 E-mail: refresh.ls...@gmail.com


Re: bgp convergence problem

2014-05-07 Thread Mark Tinka
On Wednesday, May 07, 2014 07:28:46 PM Peter Rubenstein 
wrote:

 Operationally speaking, AS1 should not be leaking routes
 from one upstream to the other.  Bad route policy. 
 Also, AS3 should not accept routes from AS1 that don't
 belong to it.  Customer router filtering would prevent
 this.

How I wish this happened in real life.

We are chasing route leaks several AS's down the path that 
are not even remotely connected to us on a weekly basis. But 
I guess that's what they pay us for :-(.

Mark.


signature.asc
Description: This is a digitally signed message part.


Re: bgp convergence problem

2014-05-06 Thread ISP Services

Hi Song Li,

As far as I know there are 2 mechanisms that should prevent this 
situation you describe from happening:


- Not advertising routes that are not in the RIB
Once AS1's peering with AS3 comes back up, the route through AS3 is 
learned and preferred. Therefore the route via AS2 is purged from the 
RIB. Once it is no longer in the RIB, AS1 cannot announce that path anymore.


- AS Path loop prevention
If AS1 still leaks the prefix to AS3, it can only announce the active 
path which points to AS3 itself. Therefore AS3 will see a prefix with 
its own ASN in the path and (should) drop the prefix. Crisis avoided.


My textbook knowledge is a bit rusty though..

Dennis Hagens

Song Li schreef op 5/6/14 5:58 AM:

Hi everyone,

I have one bgp convergence problem which confused me. The problem is as
follows:

 ++
 |  AS5   |
  +--+16.1/16 |
  |  +-+--+
  ||
  +---+--+ |
  | AS4  | |
  |  | |
  ++-+ |
   |   |
   |   |
   |   |
+-+--+  +-+-+
|  AS2   |  | AS3   | 16.1/16 (5)
|  ISP   |  | ISP   |
+---^+  +---^---+
 |   |
 | ++|
 +-+  AS1   ++
   |customer|
   ++
  16.1/16 (2 4 5)

AS1 multihomed to AS2 and AS3, for some reasons AS1 disconnect from AS3,
and as a resutl the route to 16.1/16 will be 16.1/16 (2 4 5).

After a while, the BGP seesion between AS1 and AS3 reestablished  but
AS1 leaks the route 16.1/16 (2 4 5) to AS3. At this point,

1/ AS1 will have two bgp routes for prefix 16.1/16: 16.1/16(2 4 5)and
16.1/16(3 5), according to shorter AS_PATH it will select 16.1/16(3 5)
as best route.

2/ AS3 also have two bgp routes: 16.1/16(2 4 5) and 16.1/16(5),
according to local_pref it will select 16.1/16(2 4 5).

in this case, AS1 and AS3 select each other as the best route to AS5, i
wonder which route will be the final best route after bgp convergence in
AS1 and AS3.

Thanks!






Re: bgp convergence problem

2014-05-06 Thread Song Li

Hi Dennis,

I think there are two possible convergence results:

1/ AS3 accepted route 16.1/16(2 4 5) from AS1, then it will withdraw 
announce of 16.1/16(5) towards AS1. And AS1 will remain 16.1/16 (2 4 5).


2/ AS1 accepted route 16.1/16(3 5) from AS3, then it withdraw 16.1/16(2 
4 5), and AS3 will remain 16.1/16(5).


I simulated this case in GNS3, and only got the first kind of result, i 
don't know why?


Song

于 2014/5/6 18:13, ISP Services 写道:

Hi Song Li,

As far as I know there are 2 mechanisms that should prevent this 
situation you describe from happening:


- Not advertising routes that are not in the RIB
Once AS1's peering with AS3 comes back up, the route through AS3 is 
learned and preferred. Therefore the route via AS2 is purged from the 
RIB. Once it is no longer in the RIB, AS1 cannot announce that path 
anymore.


- AS Path loop prevention
If AS1 still leaks the prefix to AS3, it can only announce the active 
path which points to AS3 itself. Therefore AS3 will see a prefix with 
its own ASN in the path and (should) drop the prefix. Crisis avoided.


My textbook knowledge is a bit rusty though..

Dennis Hagens

Song Li schreef op 5/6/14 5:58 AM:

Hi everyone,

I have one bgp convergence problem which confused me. The problem is as
follows:

 ++
 |  AS5   |
  +--+16.1/16 |
  |  +-+--+
  ||
  +---+--+ |
  | AS4  | |
  |  | |
  ++-+ |
   |   |
   |   |
   |   |
+-+--+  +-+-+
|  AS2   |  | AS3   | 16.1/16 (5)
|  ISP   |  | ISP   |
+---^+  +---^---+
 |   |
 | ++|
 +-+  AS1   ++
   |customer|
   ++
  16.1/16 (2 4 5)

AS1 multihomed to AS2 and AS3, for some reasons AS1 disconnect from AS3,
and as a resutl the route to 16.1/16 will be 16.1/16 (2 4 5).

After a while, the BGP seesion between AS1 and AS3 reestablished  but
AS1 leaks the route 16.1/16 (2 4 5) to AS3. At this point,

1/ AS1 will have two bgp routes for prefix 16.1/16: 16.1/16(2 4 5)and
16.1/16(3 5), according to shorter AS_PATH it will select 16.1/16(3 5)
as best route.

2/ AS3 also have two bgp routes: 16.1/16(2 4 5) and 16.1/16(5),
according to local_pref it will select 16.1/16(2 4 5).

in this case, AS1 and AS3 select each other as the best route to AS5, i
wonder which route will be the final best route after bgp convergence in
AS1 and AS3.

Thanks!







--
Song Li
Room 4-204, FIT Building,
Network Security,
Department of Electronic Engineering,
Tsinghua University, Beijing 100084, China
Tel:( +86) 010-62446440
E-mail: refresh.ls...@gmail.com



Re: bgp convergence problem

2014-05-06 Thread ISP Services

I suggest you work your way down :-)

http://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/13753-25.html

Dennis Hagens

Song Li schreef op 5/6/14 1:42 PM:

Hi Dennis,

I think there are two possible convergence results:

1/ AS3 accepted route 16.1/16(2 4 5) from AS1, then it will withdraw
announce of 16.1/16(5) towards AS1. And AS1 will remain 16.1/16 (2 4 5).

2/ AS1 accepted route 16.1/16(3 5) from AS3, then it withdraw 16.1/16(2
4 5), and AS3 will remain 16.1/16(5).

I simulated this case in GNS3, and only got the first kind of result, i
don't know why?

Song

于 2014/5/6 18:13, ISP Services 写道:

Hi Song Li,

As far as I know there are 2 mechanisms that should prevent this
situation you describe from happening:

- Not advertising routes that are not in the RIB
Once AS1's peering with AS3 comes back up, the route through AS3 is
learned and preferred. Therefore the route via AS2 is purged from the
RIB. Once it is no longer in the RIB, AS1 cannot announce that path
anymore.

- AS Path loop prevention
If AS1 still leaks the prefix to AS3, it can only announce the active
path which points to AS3 itself. Therefore AS3 will see a prefix with
its own ASN in the path and (should) drop the prefix. Crisis avoided.

My textbook knowledge is a bit rusty though..

Dennis Hagens

Song Li schreef op 5/6/14 5:58 AM:

Hi everyone,

I have one bgp convergence problem which confused me. The problem is as
follows:

 ++
 |  AS5   |
  +--+16.1/16 |
  |  +-+--+
  ||
  +---+--+ |
  | AS4  | |
  |  | |
  ++-+ |
   |   |
   |   |
   |   |
+-+--+  +-+-+
|  AS2   |  | AS3   | 16.1/16 (5)
|  ISP   |  | ISP   |
+---^+  +---^---+
 |   |
 | ++|
 +-+  AS1   ++
   |customer|
   ++
  16.1/16 (2 4 5)

AS1 multihomed to AS2 and AS3, for some reasons AS1 disconnect from AS3,
and as a resutl the route to 16.1/16 will be 16.1/16 (2 4 5).

After a while, the BGP seesion between AS1 and AS3 reestablished  but
AS1 leaks the route 16.1/16 (2 4 5) to AS3. At this point,

1/ AS1 will have two bgp routes for prefix 16.1/16: 16.1/16(2 4 5)and
16.1/16(3 5), according to shorter AS_PATH it will select 16.1/16(3 5)
as best route.

2/ AS3 also have two bgp routes: 16.1/16(2 4 5) and 16.1/16(5),
according to local_pref it will select 16.1/16(2 4 5).

in this case, AS1 and AS3 select each other as the best route to AS5, i
wonder which route will be the final best route after bgp convergence in
AS1 and AS3.

Thanks!












Re: bgp convergence problem

2014-05-06 Thread Valdis . Kletnieks
On Tue, 06 May 2014 11:58:58 +0800, Song Li said:

 I have one bgp convergence problem which confused me. The problem is as
 follows:

You may want to Google for 'BGP Wedgie'.

https://www.nanog.org/meetings/nanog31/presentations/griffin.pdf
http://www.rfc-base.org/txt/rfc-4264.txt

Once you understand how and why they happen, your routing question
will become clear. :)


pgpXidMd7ZvVn.pgp
Description: PGP signature


Re: bgp convergence problem

2014-05-06 Thread Randy Bush
 I have one bgp convergence problem which confused me. The problem is as 
 follows:
 
  ++
  |  AS5   |
   +--+16.1/16 |
   |  +-+--+
   ||
   +---+--+ |
   | AS4  | |
   |  | |
   ++-+ |
|   |
|   |
|   |
 +-+--+  +-+-+
 |  AS2   |  | AS3   | 16.1/16 (5)
 |  ISP   |  | ISP   |
 +---^+  +---^---+
  |   |
  | ++|
  +-+  AS1   ++
|customer|
++
   16.1/16 (2 4 5)
 
 AS1 multihomed to AS2 and AS3, for some reasons AS1 disconnect from AS3, 
 and as a resutl the route to 16.1/16 will be 16.1/16 (2 4 5).
 
 After a while, the BGP seesion between AS1 and AS3 reestablished  but 
 AS1 leaks the route 16.1/16 (2 4 5) to AS3. At this point,
 
 1/ AS1 will have two bgp routes for prefix 16.1/16: 16.1/16(2 4 5)and 
 16.1/16(3 5), according to shorter AS_PATH it will select 16.1/16(3 5) 
 as best route.
 
 2/ AS3 also have two bgp routes: 16.1/16(2 4 5) and 16.1/16(5), 
 according to local_pref it will select 16.1/16(2 4 5).
 
 in this case, AS1 and AS3 select each other as the best route to AS5, i 
 wonder which route will be the final best route after bgp convergence in 
 AS1 and AS3.

this is a bgp wedgie.  is it real and caught in the wild?  tim would be
cheered.

randy


bgp convergence problem

2014-05-05 Thread Song Li

Hi everyone,

I have one bgp convergence problem which confused me. The problem is as 
follows:


++
|  AS5   |
 +--+16.1/16 |
 |  +-+--+
 ||
 +---+--+ |
 | AS4  | |
 |  | |
 ++-+ |
  |   |
  |   |
  |   |
+-+--+  +-+-+
|  AS2   |  | AS3   | 16.1/16 (5)
|  ISP   |  | ISP   |
+---^+  +---^---+
|   |
| ++|
+-+  AS1   ++
  |customer|
  ++
 16.1/16 (2 4 5)

AS1 multihomed to AS2 and AS3, for some reasons AS1 disconnect from AS3, 
and as a resutl the route to 16.1/16 will be 16.1/16 (2 4 5).


After a while, the BGP seesion between AS1 and AS3 reestablished  but 
AS1 leaks the route 16.1/16 (2 4 5) to AS3. At this point,


1/ AS1 will have two bgp routes for prefix 16.1/16: 16.1/16(2 4 5)and 
16.1/16(3 5), according to shorter AS_PATH it will select 16.1/16(3 5) 
as best route.


2/ AS3 also have two bgp routes: 16.1/16(2 4 5) and 16.1/16(5), 
according to local_pref it will select 16.1/16(2 4 5).


in this case, AS1 and AS3 select each other as the best route to AS5, i 
wonder which route will be the final best route after bgp convergence in 
AS1 and AS3.


Thanks!

--
Song Li
Room 4-204, FIT Building,
Network Security,
Department of Electronic Engineering,
Tsinghua University, Beijing 100084, China
Tel:( +86) 010-62446440
E-mail: refresh.ls...@gmail.com


BGP convergence problem

2010-06-08 Thread Andy B.
Hi,

This morning there was an ethernet loop problem on DECIX, causing many
BGP sessions to flap throughout the entire platform.
While this can happen, I am myself facing with BGP convergence
problems on our DECIX router (SUP720-3BXL with IOS SXI3).

De DECIX loop has been solved two hours ago, but my BGP sessions are
still flapping and not converging at all. This has been flooding our
logs, and is still going on:

Jun  8 11:47:03 x.x.x.131 239447: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.32 Up
Jun  8 11:47:03 x.x.x.131 239448: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.231 Up
Jun  8 11:47:03 x.x.x.131 239449: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.109 Up
Jun  8 11:47:03 x.x.x.131 239450: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.50 Up
Jun  8 11:47:03 x.x.x.131 239451: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.81 Up
Jun  8 11:47:03 x.x.x.131 239452: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.28 Up
Jun  8 11:47:03 x.x.x.131 239453: Jun  8 11:48:38.364 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.193.212 Up
Jun  8 11:47:03 x.x.x.131 239454: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.193.147 Up
Jun  8 11:47:03 x.x.x.131 239455: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.74 Up
Jun  8 11:47:03 x.x.x.131 239456: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.241 Up
Jun  8 11:47:03 x.x.x.131 239457: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.5 Up
Jun  8 11:47:03 x.x.x.131 239458: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.40 Up
Jun  8 11:47:03 x.x.x.131 239459: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::1A44:0:1 Up
Jun  8 11:47:03 x.x.x.131 239460: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::8605:0:1 Up
Jun  8 11:47:03 x.x.x.131 239461: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::1A0B:0:1 Up
Jun  8 11:47:03 x.x.x.131 239462: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::3029:0:1 Up
Jun  8 11:47:03 x.x.x.131 239463: Jun  8 11:48:38.368 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::6E4:0:1 Up
Jun  8 11:47:03 x.x.x.131 239464: Jun  8 11:48:38.372 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::CB0:0:1 Up
Jun  8 11:47:03 x.x.x.131 239465: Jun  8 11:48:38.372 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::21C8:0:1 Up
Jun  8 11:47:03 x.x.x.131 239466: Jun  8 11:48:38.372 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::8463:0:2 Up
Jun  8 11:47:04 x.x.x.131 239467: Jun  8 11:48:38.372 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::31AA:0:1 Up
Jun  8 11:47:04 x.x.x.131 239468: Jun  8 11:48:38.372 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.29 Up
Jun  8 11:47:04 x.x.x.131 239469: Jun  8 11:48:38.372 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::62BF:0:1 Up
Jun  8 11:47:04 x.x.x.131 239470: Jun  8 11:48:39.656 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.101 Down BGP Notification sent
Jun  8 11:47:04 x.x.x.131 239471: Jun  8 11:48:39.656 CEST:
%BGP-3-NOTIFICATION: sent to neighbor 80.81.192.101 4/0 (hold time
expired) 0 bytes
Jun  8 11:47:07 x.x.x.131 239472: Jun  8 11:48:41.696 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.104 Up
Jun  8 11:47:10 x.x.x.131 239473: Jun  8 11:48:44.488 CEST:
%BGP-3-BGP_NO_REMOTE_READ: 80.81.193.187 connection timed out - has
not accepted a message from us for 2ms (hold time), 1 messages
pending transmition.
Jun  8 11:47:10 x.x.x.131 239474: Jun  8 11:48:44.488 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.193.187 Down BGP Notification sent
Jun  8 11:47:10 x.x.x.131 239475: Jun  8 11:48:44.488 CEST:
%BGP-3-NOTIFICATION: sent to neighbor 80.81.193.187 4/0 (hold time
expired) 0 bytes
Jun  8 11:47:10 x.x.x.131 239476: Jun  8 11:48:44.900 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.61 Up
Jun  8 11:47:10 x.x.x.131 239477: Jun  8 11:48:44.900 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.149 Up
Jun  8 11:47:10 x.x.x.131 239478: Jun  8 11:48:44.900 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.136 Up
Jun  8 11:47:10 x.x.x.131 239479: Jun  8 11:48:44.904 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::8463:0:1 Up
Jun  8 11:47:10 x.x.x.131 239480: Jun  8 11:48:46.352 CEST:
%BGP-5-ADJCHANGE: neighbor 2001:7F8::6268:0:1 Up
Jun  8 11:47:14 x.x.x.131 239481: Jun  8 11:48:48.084 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.193.78 Up
Jun  8 11:47:14 x.x.x.131 239482: Jun  8 11:48:49.172 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.193.239 Up
Jun  8 11:47:14 x.x.x.131 239483: Jun  8 11:48:49.172 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.24 Up
Jun  8 11:47:17 x.x.x.131 239484: Jun  8 11:48:52.160 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.194.45 Up
Jun  8 11:47:17 x.x.x.131 239485: Jun  8 11:48:52.160 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.108 Up
Jun  8 11:47:17 x.x.x.131 239486: Jun  8 11:48:52.160 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.192.164 Up
Jun  8 11:47:17 x.x.x.131 239487: Jun  8 11:48:52.164 CEST:
%BGP-5-ADJCHANGE: neighbor 80.81.193.49 Up
Jun  8 11:47:17 x.x.x.131 

Re: BGP convergence problem

2010-06-08 Thread Ingo Flaschberger

Dear Andy


This morning there was an ethernet loop problem on DECIX, causing many
BGP sessions to flap throughout the entire platform.
While this can happen, I am myself facing with BGP convergence
problems on our DECIX router (SUP720-3BXL with IOS SXI3).

De DECIX loop has been solved two hours ago, but my BGP sessions are
still flapping and not converging at all. This has been flooding our
logs, and is still going on:


route half or more of the peering-network to Null - lowering bgp session 
up's.

(at the other side, your bgp-router seems to be overloaded).

Kind regards,
Ingo Flaschberger




Re: BGP convergence problem

2010-06-08 Thread Andy B.
I finally decided to shut down all peerings and brought them back one by one.

Everything is stable again, but I don't like the way I had to deal
with it since it will most likely happen again when DECIX or an other
IX we're at is having issues.

I've seen a few BGP convergence discussions on NANOG, but none about
deadlock situations and what could be done to avoid them. Setting
higher MTU or bigger hold queues did not help.

- Andy

On Tue, Jun 8, 2010 at 2:35 PM, Ingo Flaschberger i...@xip.at wrote:
 Dear Andy

 This morning there was an ethernet loop problem on DECIX, causing many
 BGP sessions to flap throughout the entire platform.
 While this can happen, I am myself facing with BGP convergence
 problems on our DECIX router (SUP720-3BXL with IOS SXI3).

 De DECIX loop has been solved two hours ago, but my BGP sessions are
 still flapping and not converging at all. This has been flooding our
 logs, and is still going on:

 route half or more of the peering-network to Null - lowering bgp session
 up's.
 (at the other side, your bgp-router seems to be overloaded).

 Kind regards,
        Ingo Flaschberger





Re: BGP convergence problem

2010-06-08 Thread Jared Mauch

On Jun 8, 2010, at 10:27 AM, Andy B. wrote:

 I finally decided to shut down all peerings and brought them back one by one.
 
 Everything is stable again, but I don't like the way I had to deal
 with it since it will most likely happen again when DECIX or an other
 IX we're at is having issues.
 
 I've seen a few BGP convergence discussions on NANOG, but none about
 deadlock situations and what could be done to avoid them. Setting
 higher MTU or bigger hold queues did not help.

The Cisco 7600 and 6500 platforms are getting fairly old and have underpowered 
cpus these days.

Starting in SXH the control plane did not scale quite as well as in SXF.  This 
got better in SXI, but is not back on par with SXF performance yet.

I mostly attribute this to a combination of bloat in software and routing 
tables.  I would start to look for a replacement sooner rather than later.

- Jared


Re: BGP convergence problem

2010-06-08 Thread Matthew Petach
On Tue, Jun 8, 2010 at 7:27 AM, Andy B. globic...@gmail.com wrote:
 I finally decided to shut down all peerings and brought them back one by one.

 Everything is stable again, but I don't like the way I had to deal
 with it since it will most likely happen again when DECIX or an other
 IX we're at is having issues.

 I've seen a few BGP convergence discussions on NANOG, but none about
 deadlock situations and what could be done to avoid them. Setting
 higher MTU or bigger hold queues did not help.

 - Andy

Some people have found that upgrading to an alternate router vendor
helps.  ^_^;

Fundamentally, the CPU on your router is underpowered for the amount
of state information that needs to be updated in the time window of the
hold timers.  If you can't move to a faster/more efficient platform, then
you may need to negotiate raising the keepalive interval and corresponding
hold timers with your neighbors, to give your router time to finish processing
updates.

Alternately, if you aren't in a position to be able to upgrade platforms, but
have spare routers around, connecting a second router up to the exchange
and splitting your neighbors up among two links into the exchange would
reduce the load on each router during reconvergence, and buy you time
until you can move to a more capable platform.

Matt



Re: BGP convergence problem

2010-06-08 Thread Richard A Steenbergen
On Tue, Jun 08, 2010 at 12:22:04PM -0400, Jared Mauch wrote:
 
 The Cisco 7600 and 6500 platforms are getting fairly old and have
 underpowered cpus these days.
 
 Starting in SXH the control plane did not scale quite as well as in
 SXF.  This got better in SXI, but is not back on par with SXF
 performance yet.
 
 I mostly attribute this to a combination of bloat in software and
 routing tables.  I would start to look for a replacement sooner rather
 than later.

Place blame where blame is due, the cpu may be slow, but the crappy ios
scheduler is the real problem here. We saw a huge reduction in the
number of self-sustaining protocols timeouts cycles on these boxes
(where the process of trying to bring up a new neighbor and converge
routing uses so much cpu that it causes other neighbors to time out,
resulting in a never-ending cycle of fail until you shut down everything
and bring them up one neighbor at a time) with the move from SXF to the 
SR branches. We never really went down the SXH/SXI road, but I'd have 
assumed they would have introduced the same improvements there too. I 
guess you know what they say about assuming. :)

Try the usual suspects:

* Configure process-max-time 20 at the top level, this improves 
interactivity by making the scheduler switch processes more often.

* Make sure you don't have an overly aggressive control-plane policer. 
In my experience the COPP rate-limits are quite harsh, and if you end up 
bumping against them you don't get a graceful slowing of the exchange of 
routes, you get protocol timeouts.

* Make sure you don't have any stupid mls rate-limits, such as cef 
receive. I don't know why anyone would ever want to configure this, all 
it does is make your box fall over faster (as if these things need any 
help) by rate-limiting all traffic to the msfc.

* You might want to try something like scheduler allocate 400 4000,
which gives the vast majority of the cpu time to the control plane
rather than process switching on the data plane (which in theory
shouldn't happen on an entirely hw forwarded box like 6500/7600, though 
of course we all know that isn't true :P).

Oh and also the OP should take this to the cisco-nsp mailing list, where 
all the good bitching about broken Crisco routers takes place. :)

-- 
Richard A Steenbergen r...@e-gerbil.net   http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)



Re: BGP convergence problem

2010-06-08 Thread Randy Bush
 The Cisco 7600 and 6500 platforms are getting fairly old and have
 underpowered cpus these days.

the hamsters in them were never well fed, ever.  though i have never run
one, too yucchhy, i have measured receiving a research feed from one.
over ten minutes for a full table while a router takes two.

some researcher into archeology might try to measure if is just a sick
tcp or if it is closer to rib-out.

randy



Re: BGP convergence problem

2010-06-08 Thread Niels Bakker

* globic...@gmail.com (Andy B.) [Tue 08 Jun 2010, 16:28 CEST]:
I finally decided to shut down all peerings and brought them back 
one by one.


Sadly that's often the way it has to be done, modulo mild tweaks.


Everything is stable again, but I don't like the way I had to deal 
with it since it will most likely happen again when DECIX or an 
other IX we're at is having issues.


As others have said upthread in more polite wordings, get a better 
router if yours can't handle the load.  (Or use the route servers more 
- it's what they're there for.)



I've seen a few BGP convergence discussions on NANOG, but none about 
deadlock situations and what could be done to avoid them. Setting 
higher MTU or bigger hold queues did not help.


I hope you didn't change the MTU to anything different from what 
everybody else on the DE-CIX Peering LAN uses - that only leads to 
suffering.



-- Niels.

--
It's amazing what people will do to get their name on the internet, 
 which is odd, because all you really need is a Blogspot account.

-- roy edroso, alicublog.blogspot.com