Re: [j-nsp] route BGP stall bug
On (2012-07-25 09:08 +0200), Daniel Roesen wrote: > As discussed "offline", not when OSPF implementations follow OSPFv2 spec > from 1998, (the still current) RFC2328. I'm not aware that widely used > implementations behave different and am too lazy to lab that. :) Just tried in 11.4R3 and 'overload' node still transits, other nodes see it just with '65536' as metric. One would hope that 'no-rfc-1583' would change this behaviour, but it does not. Unsure how current IOS behaves. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On Mon, Jul 23, 2012 at 07:07:51PM +0300, Saku Ytti wrote: > > To cut Cisco some slack, OSPF has no simple overload bit, so you have to > > raise your link metrics all the way for a somewhat crude and limited > > emulation of IS-IS' overload bit. There are still possible setups where > > 'max-metric router-lsa' is very accurate replica. As discussed "offline", not when OSPF implementations follow OSPFv2 spec from 1998, (the still current) RFC2328. I'm not aware that widely used implementations behave different and am too lazy to lab that. :) Best regards, Daniel -- CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On (2012-07-23 17:16 +0200), Daniel Roesen wrote: > To cut Cisco some slack, OSPF has no simple overload bit, so you have to > raise your link metrics all the way for a somewhat crude and limited > emulation of IS-IS' overload bit. There are still possible setups where 'max-metric router-lsa' is very accurate replica. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On Mon, Jul 23, 2012 at 03:14:08PM +0300, Saku Ytti wrote: > On (2012-07-23 13:55 +0200), Daniel Roesen wrote: > > > Actually, I see it came in 12.0(10)S, so it was there, but we were > > running OSPF, not IS-IS, so no help. > > Aye. Some dirt towards cisco too, it is ridiculous you don't have feature > parity in ISIS and OSPF in things like these, these should use common > codepath, single file in source code repo for any half decent design. To cut Cisco some slack, OSPF has no simple overload bit, so you have to raise your link metrics all the way for a somewhat crude and limited emulation of IS-IS' overload bit. There are still possible setups where an OSPF-based wannabe-overload replica doesn't work like IS-IS overload would do and still lead to blackholing. Although that would be quite specific/special scenarios. Best regards, Daniel -- CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On (2012-07-23 13:55 +0200), Daniel Roesen wrote: > Actually, I see it came in 12.0(10)S, so it was there, but we were > running OSPF, not IS-IS, so no help. Aye. Some dirt towards cisco too, it is ridiculous you don't have feature parity in ISIS and OSPF in things like these, these should use common codepath, single file in source code repo for any half decent design. While academically OSPF and ISIS are very much the same, in practice as they are developed in very insulated manner, it makes sense to run what everyone else is running, in hopes that the big dogs lab them and request for critical features. This is, in my opinion, the main reason for SP to run ISIS. Common wisdom is to run which ever your people are familiar with, but people change and learning basics of operating ISIS or OSPF is matter of hours anyhow. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On Mon, Jul 23, 2012 at 01:46:16PM +0200, Daniel Roesen wrote: > > 'external overload signalling' > > Did that exist in 2001 deployed 12.0S code? > If so, we obviously didn't know about it. Actually, I see it came in 12.0(10)S, so it was there, but we were running OSPF, not IS-IS, so no help. Best regards, Daniel -- CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On (2012-07-23 13:46 +0200), Daniel Roesen wrote: > > 'external overload signalling' > > Did that exist in 2001 deployed 12.0S code? > If so, we obviously didn't know about it. My memory is hazy, but I would date earlier, maybe to 1999. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On Mon, Jul 23, 2012 at 11:09:02AM +0300, Saku Ytti wrote: > On (2012-07-23 10:02 +0200), Daniel Roesen wrote: > > > Which wasn't really correct. What happened was that the linecard ran out > > of memory, and disabled CEF. Result: IGP adjacency remained up, but > > traffic forwarding stopped. Instant blackholing. > > 'external overload signalling' Did that exist in 2001 deployed 12.0S code? If so, we obviously didn't know about it. Best regards, Daniel -- CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On (2012-07-23 10:02 +0200), Daniel Roesen wrote: > Which wasn't really correct. What happened was that the linecard ran out > of memory, and disabled CEF. Result: IGP adjacency remained up, but > traffic forwarding stopped. Instant blackholing. 'external overload signalling' -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On Wed, Jul 18, 2012 at 10:22:26AM +0300, Saku Ytti wrote: > I remember back in the JunOS4 days when Juniper was trying to enter the > market, one of the first sales pitches I heard 'you know how GSR drops > neighbours and stops forwarding when it runs out of memory?' Which wasn't really correct. What happened was that the linecard ran out of memory, and disabled CEF. Result: IGP adjacency remained up, but traffic forwarding stopped. Instant blackholing. Best regards, Dan'first-hand report'iel -- CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
Hi Tim, Do you happen to have port mirroring/sampling enabled on the router? We encountered a similar issue, JTAC found out that sampled process was causing this behavior and it is solved in 11.4R4 (we did not upgrade yet to test in our environment and it also doesn't appear in the release notes however the JTAC engineer said it is solved) The relevant PR is PR726841, while it is with the details of our specific case test, the issue is (according to JTAC) "Sampled being the slow daemon lead to the slow operation of route updation from KRT --- PFE and KRT was stuck for some time." Regards, Ido -Original Message- From: juniper-nsp-boun...@puck.nether.net [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf Of Tim Vollebregt Sent: Wednesday, July 18, 2012 1:04 AM To: Juniper-NSP Subject: [j-nsp] route BGP stall bug Hi All, This morning during a maintenance I experienced the route stall bug Richard mentioned a few times already on j-nsp. Hardware kit: -MX480 with SCB (non-e) -2 x RE-S-1800x4 -4 x MPC 3D 16x 10GE Software version: 10.4R8.5 During this maintenance I was placing 2 new routing engines into the router, replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 'mate'/several iBGP peers with partial tables After replacing the RE's the FPC's initialized and BGP sessions were being established it took quite some time before the RIB was completely filled. After checking some hosts I came to the conclusion that there were unreachable destinations however the RIB was looking fine. When checking the FIB by issuing command: show route forwarding-table summary I saw that there were only 11K prefixes pushed to the FIB and it was hanging. As I was aware of the bug I waited for some time. And it eventually took about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of destinations were unreachable and traffic was being blackholed as exchanging RIB with peers was fine. As there was still some time left in the maintenance window and I really wanted to have some workaround for dealing with this bug I did the following. I deactivated all eBGP peer groups and did a switchover to the other routing engine. When the PFC's were initialized the router started building it's iBGP sessions towards the core routers, and it's RR session (full table). This worked out quite well, the FIB was being filled with the full table within 5 minutes. Afterwards I activated all eBGP peergroups again and monitored the FIB, eventually it took about 30 minutes to fill the FIB with the correct next-hops. But this time the blackholing was just for a limited amount of time. It seems this bug is there since release 10.0 (MPC), and there doesn't seem to be a fix yet. Does anyone have more information about it, PR number etc? IMHO this is a really bad one, and can be a showstopper in some cases. Thanks for your time. BR, Tim ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On Wed, Jul 18, 2012 at 12:03:39AM +0200, Tim Vollebregt wrote: > Hi All, > > This morning during a maintenance I experienced the route stall bug > Richard mentioned a few times already on j-nsp. > > Hardware kit: > -MX480 with SCB (non-e) > -2 x RE-S-1800x4 > -4 x MPC 3D 16x 10GE > Software version: 10.4R8.5 > During this maintenance I was placing 2 new routing engines into the > router, replacing the 'old' RE-S-2000. This router is pushing a lot of > traffic and receiving 14 x full BGP tables from eBGP peers/1 RR > session to it's 'mate'/several iBGP peers with partial tables Rest assured this issue is still alive and well in every piece of code I've ever looked at. I've basically just given up and accepted that Juniper can't actually handle a large number of routes, and nobody seems capable of fixing it. EX's are especially bad, I can't get a full fib installed from a reboot in anything less than an hour, even if I turn off most of the BGP sessions so it converges faster. Either stop carrying so many routes (14x full tables = you're screwed), or go buy a Cisco. :( -- Richard A Steenbergenhttp://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC) ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
Hi, Is there any suspicious messages logged at that moment ? There are some PRs related to krt queue stuck, so probably you want to upgrade to 10.4R10 or investigate this issue with jtac. https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR722890 On 18.07.2012 2:03, Tim Vollebregt wrote: Hi All, This morning during a maintenance I experienced the route stall bug Richard mentioned a few times already on j-nsp. Hardware kit: -MX480 with SCB (non-e) -2 x RE-S-1800x4 -4 x MPC 3D 16x 10GE Software version: 10.4R8.5 During this maintenance I was placing 2 new routing engines into the router, replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 'mate'/several iBGP peers with partial tables After replacing the RE's the FPC's initialized and BGP sessions were being established it took quite some time before the RIB was completely filled. After checking some hosts I came to the conclusion that there were unreachable destinations however the RIB was looking fine. When checking the FIB by issuing command: show route forwarding-table summary I saw that there were only 11K prefixes pushed to the FIB and it was hanging. As I was aware of the bug I waited for some time. And it eventually took about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of destinations were unreachable and traffic was being blackholed as exchanging RIB with peers was fine. As there was still some time left in the maintenance window and I really wanted to have some workaround for dealing with this bug I did the following. I deactivated all eBGP peer groups and did a switchover to the other routing engine. When the PFC's were initialized the router started building it's iBGP sessions towards the core routers, and it's RR session (full table). This worked out quite well, the FIB was being filled with the full table within 5 minutes. Afterwards I activated all eBGP peergroups again and monitored the FIB, eventually it took about 30 minutes to fill the FIB with the correct next-hops. But this time the blackholing was just for a limited amount of time. It seems this bug is there since release 10.0 (MPC), and there doesn't seem to be a fix yet. Does anyone have more information about it, PR number etc? IMHO this is a really bad one, and can be a showstopper in some cases. Thanks for your time. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
On (2012-07-18 00:03 +0200), Tim Vollebregt wrote: > IMHO this is a really bad one, and can be a showstopper in some cases. Blackholing is absolutely worst thing router can do. And Juniper often seems to have failure modes where they blackhole. In Cisco much more common failure mode is total failure (i.e. crash), which I'm fine with (that low is my expectations for network hardware vendors) as I can route around it. I remember back in the JunOS4 days when Juniper was trying to enter the market, one of the first sales pitches I heard 'you know how GSR drops neighbours and stops forwarding when it runs out of memory?' ... 'well, we just keep on forwarding based on the information we currently have!'. I ran out of hands to properly express the level of facepalm I wanted. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] route BGP stall bug
Try the hidden show krt queue command when this happens. Should give you an idea what is going on. Jared Mauch On Jul 17, 2012, at 6:03 PM, Tim Vollebregt wrote: > Hi All, > > This morning during a maintenance I experienced the route stall bug Richard > mentioned a few times already on j-nsp. > > Hardware kit: > -MX480 with SCB (non-e) > -2 x RE-S-1800x4 > -4 x MPC 3D 16x 10GE > Software version: 10.4R8.5 > During this maintenance I was placing 2 new routing engines into the router, > replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and > receiving 14 x full BGP tables from eBGP peers/1 RR session to it's > 'mate'/several iBGP peers with partial tables > > After replacing the RE's the FPC's initialized and BGP sessions were being > established it took quite some time before the RIB was completely filled. > After checking some hosts I came to the conclusion that there were > unreachable destinations however the RIB was looking fine. > > When checking the FIB by issuing command: show route forwarding-table summary > I saw that there were only 11K prefixes pushed to the FIB and it was hanging. > As I was aware of the bug I waited for some time. And it eventually took > about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes > a lot of destinations were unreachable and traffic was being blackholed as > exchanging RIB with peers was fine. > > As there was still some time left in the maintenance window and I really > wanted to have some workaround for dealing with this bug I did the following. > I deactivated all eBGP peer groups and did a switchover to the other routing > engine. When the PFC's were initialized the router started building it's iBGP > sessions towards the core routers, and it's RR session (full table). > > This worked out quite well, the FIB was being filled with the full table > within 5 minutes. Afterwards I activated all eBGP peergroups again and > monitored the FIB, eventually it took about 30 minutes to fill the FIB with > the correct next-hops. But this time the blackholing was just for a limited > amount of time. > > It seems this bug is there since release 10.0 (MPC), and there doesn't seem > to be a fix yet. Does anyone have more information about it, PR number etc? > > IMHO this is a really bad one, and can be a showstopper in some cases. > > Thanks for your time. > > BR, Tim > > > > > > > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
[j-nsp] route BGP stall bug
Hi All, This morning during a maintenance I experienced the route stall bug Richard mentioned a few times already on j-nsp. Hardware kit: -MX480 with SCB (non-e) -2 x RE-S-1800x4 -4 x MPC 3D 16x 10GE Software version: 10.4R8.5 During this maintenance I was placing 2 new routing engines into the router, replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 'mate'/several iBGP peers with partial tables After replacing the RE's the FPC's initialized and BGP sessions were being established it took quite some time before the RIB was completely filled. After checking some hosts I came to the conclusion that there were unreachable destinations however the RIB was looking fine. When checking the FIB by issuing command: show route forwarding-table summary I saw that there were only 11K prefixes pushed to the FIB and it was hanging. As I was aware of the bug I waited for some time. And it eventually took about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of destinations were unreachable and traffic was being blackholed as exchanging RIB with peers was fine. As there was still some time left in the maintenance window and I really wanted to have some workaround for dealing with this bug I did the following. I deactivated all eBGP peer groups and did a switchover to the other routing engine. When the PFC's were initialized the router started building it's iBGP sessions towards the core routers, and it's RR session (full table). This worked out quite well, the FIB was being filled with the full table within 5 minutes. Afterwards I activated all eBGP peergroups again and monitored the FIB, eventually it took about 30 minutes to fill the FIB with the correct next-hops. But this time the blackholing was just for a limited amount of time. It seems this bug is there since release 10.0 (MPC), and there doesn't seem to be a fix yet. Does anyone have more information about it, PR number etc? IMHO this is a really bad one, and can be a showstopper in some cases. Thanks for your time. BR, Tim ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp