Re: [j-nsp] route BGP stall bug

2012-07-25 Thread Saku Ytti
On (2012-07-25 09:08 +0200), Daniel Roesen wrote:

> As discussed "offline", not when OSPF implementations follow OSPFv2 spec
> from 1998, (the still current) RFC2328. I'm not aware that widely used
> implementations behave different and am too lazy to lab that. :)

Just tried in 11.4R3 and 'overload' node still transits, other nodes see it
just with '65536' as metric.

One would hope that 'no-rfc-1583' would change this behaviour, but it does
not.

Unsure how current IOS behaves.
-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-25 Thread Daniel Roesen
On Mon, Jul 23, 2012 at 07:07:51PM +0300, Saku Ytti wrote:
> > To cut Cisco some slack, OSPF has no simple overload bit, so you have to
> > raise your link metrics all the way for a somewhat crude and limited
> > emulation of IS-IS' overload bit. There are still possible setups where
> 
> 'max-metric router-lsa' is very accurate replica.

As discussed "offline", not when OSPF implementations follow OSPFv2 spec
from 1998, (the still current) RFC2328. I'm not aware that widely used
implementations behave different and am too lazy to lab that. :)

Best regards,
Daniel

-- 
CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Saku Ytti
On (2012-07-23 17:16 +0200), Daniel Roesen wrote:

> To cut Cisco some slack, OSPF has no simple overload bit, so you have to
> raise your link metrics all the way for a somewhat crude and limited
> emulation of IS-IS' overload bit. There are still possible setups where

'max-metric router-lsa' is very accurate replica.

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Daniel Roesen
On Mon, Jul 23, 2012 at 03:14:08PM +0300, Saku Ytti wrote:
> On (2012-07-23 13:55 +0200), Daniel Roesen wrote:
> 
> > Actually, I see it came in 12.0(10)S, so it was there, but we were
> > running OSPF, not IS-IS, so no help.
> 
> Aye. Some dirt towards cisco too, it is ridiculous you don't have feature
> parity in ISIS and OSPF in things like these, these should use common
> codepath, single file in source code repo for any half decent design.

To cut Cisco some slack, OSPF has no simple overload bit, so you have to
raise your link metrics all the way for a somewhat crude and limited
emulation of IS-IS' overload bit. There are still possible setups where
an OSPF-based wannabe-overload replica doesn't work like IS-IS overload
would do and still lead to blackholing. Although that would be quite
specific/special scenarios.

Best regards,
Daniel

-- 
CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Saku Ytti
On (2012-07-23 13:55 +0200), Daniel Roesen wrote:

> Actually, I see it came in 12.0(10)S, so it was there, but we were
> running OSPF, not IS-IS, so no help.

Aye. Some dirt towards cisco too, it is ridiculous you don't have feature
parity in ISIS and OSPF in things like these, these should use common
codepath, single file in source code repo for any half decent design.

While academically OSPF and ISIS are very much the same, in practice as
they are developed in very insulated manner, it makes sense to run what
everyone else is running, in hopes that the big dogs lab them and request
for critical features. This is, in my opinion, the main reason for SP to
run ISIS.
Common wisdom is to run which ever your people are familiar with, but
people change and learning basics of operating ISIS or OSPF is matter of
hours anyhow.

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Daniel Roesen
On Mon, Jul 23, 2012 at 01:46:16PM +0200, Daniel Roesen wrote:
> > 'external overload signalling'
> 
> Did that exist in 2001 deployed 12.0S code?
> If so, we obviously didn't know about it.

Actually, I see it came in 12.0(10)S, so it was there, but we were
running OSPF, not IS-IS, so no help.

Best regards,
Daniel

-- 
CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Saku Ytti
On (2012-07-23 13:46 +0200), Daniel Roesen wrote:

> > 'external overload signalling'
> 
> Did that exist in 2001 deployed 12.0S code?
> If so, we obviously didn't know about it.

My memory is hazy, but I would date earlier, maybe to 1999.

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Daniel Roesen
On Mon, Jul 23, 2012 at 11:09:02AM +0300, Saku Ytti wrote:
> On (2012-07-23 10:02 +0200), Daniel Roesen wrote:
> 
> > Which wasn't really correct. What happened was that the linecard ran out
> > of memory, and disabled CEF. Result: IGP adjacency remained up, but
> > traffic forwarding stopped. Instant blackholing.
> 
> 'external overload signalling'

Did that exist in 2001 deployed 12.0S code?
If so, we obviously didn't know about it.

Best regards,
Daniel

-- 
CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Saku Ytti
On (2012-07-23 10:02 +0200), Daniel Roesen wrote:

> Which wasn't really correct. What happened was that the linecard ran out
> of memory, and disabled CEF. Result: IGP adjacency remained up, but
> traffic forwarding stopped. Instant blackholing.

'external overload signalling'

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Daniel Roesen
On Wed, Jul 18, 2012 at 10:22:26AM +0300, Saku Ytti wrote:
> I remember back in the JunOS4 days when Juniper was trying to enter the
> market, one of the first sales pitches I heard 'you know how GSR drops
> neighbours and stops forwarding when it runs out of memory?'

Which wasn't really correct. What happened was that the linecard ran out
of memory, and disabled CEF. Result: IGP adjacency remained up, but
traffic forwarding stopped. Instant blackholing.


Best regards,
Dan'first-hand report'iel

-- 
CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-23 Thread Ido Szargel
Hi Tim,

Do you happen to have port mirroring/sampling enabled on the router?
We encountered a similar issue, JTAC found out that sampled process was
causing this behavior and it is solved in 11.4R4 (we did not upgrade yet to
test in our environment and it also doesn't appear in the release notes
however the JTAC engineer said it is solved)
The relevant PR is PR726841, while it is with the details of our specific
case test, the issue is (according to JTAC) "Sampled being the slow daemon
lead to the slow operation of route updation from KRT --- PFE and KRT was
stuck for some time."


Regards,
Ido


-Original Message-
From: juniper-nsp-boun...@puck.nether.net
[mailto:juniper-nsp-boun...@puck.nether.net] On Behalf Of Tim Vollebregt
Sent: Wednesday, July 18, 2012 1:04 AM
To: Juniper-NSP
Subject: [j-nsp] route BGP stall bug

Hi All,

This morning during a maintenance I experienced the route stall bug Richard
mentioned a few times already on j-nsp.

Hardware kit:
-MX480 with SCB (non-e)
-2 x RE-S-1800x4
-4 x MPC 3D 16x 10GE
Software version: 10.4R8.5
During this maintenance I was placing 2 new routing engines into the router,
replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and
receiving 14 x full BGP tables from eBGP peers/1 RR session to it's
'mate'/several iBGP peers with partial tables

After replacing the RE's the FPC's initialized and BGP sessions were being
established it took quite some time before the RIB was completely filled.
After checking some hosts I came to the conclusion that there were
unreachable destinations however the RIB was looking fine.

When checking the FIB by issuing command: show route forwarding-table
summary I saw that there were only 11K prefixes pushed to the FIB and it was
hanging.
As I was aware of the bug I waited for some time. And it eventually took
about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes
a lot of destinations were unreachable and traffic was being blackholed as
exchanging RIB with peers was fine.

As there was still some time left in the maintenance window and I really
wanted to have some workaround for dealing with this bug I did the
following.
I deactivated all eBGP peer groups and did a switchover to the other routing
engine. When the PFC's were initialized the router started building it's
iBGP sessions towards the core routers, and it's RR session (full table).

This worked out quite well, the FIB was being filled with the full table
within 5 minutes. Afterwards I activated all eBGP peergroups again and
monitored the FIB, eventually it took about 30 minutes to fill the FIB with
the correct next-hops. But this time the blackholing was just for a limited
amount of time.

It seems this bug is there since release 10.0 (MPC), and there doesn't seem
to be a fix yet. Does anyone have more information about it, PR number etc?

IMHO this is a really bad one, and can be a showstopper in some cases.

Thanks for your time.

BR, Tim






___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] route BGP stall bug

2012-07-18 Thread Richard A Steenbergen
On Wed, Jul 18, 2012 at 12:03:39AM +0200, Tim Vollebregt wrote:
> Hi All,
> 
> This morning during a maintenance I experienced the route stall bug 
> Richard mentioned a few times already on j-nsp.
> 
> Hardware kit:
> -MX480 with SCB (non-e)
> -2 x RE-S-1800x4
> -4 x MPC 3D 16x 10GE
> Software version: 10.4R8.5
> During this maintenance I was placing 2 new routing engines into the 
> router, replacing the 'old' RE-S-2000. This router is pushing a lot of 
> traffic and receiving 14 x full BGP tables from eBGP peers/1 RR 
> session to it's 'mate'/several iBGP peers with partial tables

Rest assured this issue is still alive and well in every piece of code 
I've ever looked at. I've basically just given up and accepted that 
Juniper can't actually handle a large number of routes, and nobody seems 
capable of fixing it. EX's are especially bad, I can't get a full fib 
installed from a reboot in anything less than an hour, even if I turn 
off most of the BGP sessions so it converges faster. Either stop 
carrying so many routes (14x full tables = you're screwed), or go buy a 
Cisco. :(

-- 
Richard A Steenbergenhttp://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-18 Thread Tima Maryin

Hi,


Is there any suspicious messages logged at that moment ?

There are some PRs related to krt queue stuck, so probably you want to 
upgrade to 10.4R10 or investigate this issue with jtac.


https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR722890


On 18.07.2012 2:03, Tim Vollebregt wrote:

Hi All,

This morning during a maintenance I experienced the route stall bug Richard 
mentioned a few times already on j-nsp.

Hardware kit:
-MX480 with SCB (non-e)
-2 x RE-S-1800x4
-4 x MPC 3D 16x 10GE
Software version: 10.4R8.5
During this maintenance I was placing 2 new routing engines into the router, 
replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and 
receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 
'mate'/several iBGP peers with partial tables

After replacing the RE's the FPC's initialized and BGP sessions were being 
established it took quite some time before the RIB was completely filled. After 
checking some hosts I came to the conclusion that there were unreachable 
destinations however the RIB was looking fine.

When checking the FIB by issuing command: show route forwarding-table summary I 
saw that there were only 11K prefixes pushed to the FIB and it was hanging.
As I was aware of the bug I waited for some time. And it eventually took about 
30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of 
destinations were unreachable and traffic was being blackholed as exchanging 
RIB with peers was fine.

As there was still some time left in the maintenance window and I really wanted 
to have some workaround for dealing with this bug I did the following.
I deactivated all eBGP peer groups and did a switchover to the other routing 
engine. When the PFC's were initialized the router started building it's iBGP 
sessions towards the core routers, and it's RR session (full table).

This worked out quite well, the FIB was being filled with the full table within 
5 minutes. Afterwards I activated all eBGP peergroups again and monitored the 
FIB, eventually it took about 30 minutes to fill the FIB with the correct 
next-hops. But this time the blackholing was just for a limited amount of time.

It seems this bug is there since release 10.0 (MPC), and there doesn't seem to 
be a fix yet. Does anyone have more information about it, PR number etc?

IMHO this is a really bad one, and can be a showstopper in some cases.

Thanks for your time.


___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-18 Thread Saku Ytti
On (2012-07-18 00:03 +0200), Tim Vollebregt wrote:

> IMHO this is a really bad one, and can be a showstopper in some cases.

Blackholing is absolutely worst thing router can do. And Juniper often
seems to have failure modes where they blackhole. In Cisco much more common
failure mode is total failure (i.e. crash), which I'm fine with (that low
is my expectations for network hardware vendors) as I can route around it.

I remember back in the JunOS4 days when Juniper was trying to enter the
market, one of the first sales pitches I heard 'you know how GSR drops
neighbours and stops forwarding when it runs out of memory?' ... 'well, we
just keep on forwarding based on the information we currently have!'. I ran
out of hands to properly express the level of facepalm I wanted.


-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] route BGP stall bug

2012-07-17 Thread Jared Mauch
Try the hidden show krt queue command when this happens. Should give you an 
idea what is going on. 

Jared Mauch

On Jul 17, 2012, at 6:03 PM, Tim Vollebregt  wrote:

> Hi All,
> 
> This morning during a maintenance I experienced the route stall bug Richard 
> mentioned a few times already on j-nsp.
> 
> Hardware kit:
> -MX480 with SCB (non-e)
> -2 x RE-S-1800x4
> -4 x MPC 3D 16x 10GE
> Software version: 10.4R8.5
> During this maintenance I was placing 2 new routing engines into the router, 
> replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and 
> receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 
> 'mate'/several iBGP peers with partial tables
> 
> After replacing the RE's the FPC's initialized and BGP sessions were being 
> established it took quite some time before the RIB was completely filled. 
> After checking some hosts I came to the conclusion that there were 
> unreachable destinations however the RIB was looking fine.
> 
> When checking the FIB by issuing command: show route forwarding-table summary 
> I saw that there were only 11K prefixes pushed to the FIB and it was hanging.
> As I was aware of the bug I waited for some time. And it eventually took 
> about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes 
> a lot of destinations were unreachable and traffic was being blackholed as 
> exchanging RIB with peers was fine.
> 
> As there was still some time left in the maintenance window and I really 
> wanted to have some workaround for dealing with this bug I did the following.
> I deactivated all eBGP peer groups and did a switchover to the other routing 
> engine. When the PFC's were initialized the router started building it's iBGP 
> sessions towards the core routers, and it's RR session (full table).
> 
> This worked out quite well, the FIB was being filled with the full table 
> within 5 minutes. Afterwards I activated all eBGP peergroups again and 
> monitored the FIB, eventually it took about 30 minutes to fill the FIB with 
> the correct next-hops. But this time the blackholing was just for a limited 
> amount of time.
> 
> It seems this bug is there since release 10.0 (MPC), and there doesn't seem 
> to be a fix yet. Does anyone have more information about it, PR number etc?
> 
> IMHO this is a really bad one, and can be a showstopper in some cases.
> 
> Thanks for your time.
> 
> BR, Tim
> 
> 
> 
> 
> 
> 
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


[j-nsp] route BGP stall bug

2012-07-17 Thread Tim Vollebregt
Hi All,

This morning during a maintenance I experienced the route stall bug Richard 
mentioned a few times already on j-nsp.

Hardware kit:
-MX480 with SCB (non-e)
-2 x RE-S-1800x4
-4 x MPC 3D 16x 10GE
Software version: 10.4R8.5
During this maintenance I was placing 2 new routing engines into the router, 
replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and 
receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 
'mate'/several iBGP peers with partial tables

After replacing the RE's the FPC's initialized and BGP sessions were being 
established it took quite some time before the RIB was completely filled. After 
checking some hosts I came to the conclusion that there were unreachable 
destinations however the RIB was looking fine.

When checking the FIB by issuing command: show route forwarding-table summary I 
saw that there were only 11K prefixes pushed to the FIB and it was hanging.
As I was aware of the bug I waited for some time. And it eventually took about 
30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of 
destinations were unreachable and traffic was being blackholed as exchanging 
RIB with peers was fine.

As there was still some time left in the maintenance window and I really wanted 
to have some workaround for dealing with this bug I did the following.
I deactivated all eBGP peer groups and did a switchover to the other routing 
engine. When the PFC's were initialized the router started building it's iBGP 
sessions towards the core routers, and it's RR session (full table).

This worked out quite well, the FIB was being filled with the full table within 
5 minutes. Afterwards I activated all eBGP peergroups again and monitored the 
FIB, eventually it took about 30 minutes to fill the FIB with the correct 
next-hops. But this time the blackholing was just for a limited amount of time.

It seems this bug is there since release 10.0 (MPC), and there doesn't seem to 
be a fix yet. Does anyone have more information about it, PR number etc?

IMHO this is a really bad one, and can be a showstopper in some cases.

Thanks for your time.

BR, Tim






___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp