Re: [j-nsp] MX960 Redundant RE problem
I was referring more to a bug in hardware... Bad memory, etc. Stefan Fouant JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI Technical Trainer, Juniper Networks Follow us on Twitter @JuniperEducate Sent from my iPad On Feb 15, 2012, at 1:56 PM, Daniel Roesen wrote: > On Wed, Feb 15, 2012 at 12:24:50PM -0500, Stefan Fouant wrote: >> The cool thing is the Backup RE is actually listening to all the >> control plane messages coming on fxp1 destined for the Master RE >> and formulating it's own decisions, running its own Dijkstra, >> BGP Path Selection, etc. This is a preferred approach as opposed >> to simply mirroring routing state from the Primary to the Backup >> is because it eliminates fate sharing where there may be a bug >> on the Primary RE, we don't want to create a carbon copy of that >> on the Backup. > > I don't really buy that argument. Running the same code with the same > algorithm against the same data usually leads to the same results. > You'll get full bug redundancy - I'd expect RE crashing simultaneously. > Did NSR protect from any of the recent BGP bugs? > > The advantage I see are less impacting failovers in case of a) hardware > failures of active RE, or b) data structure corruption happening on both > REs [same code => same bugs], but eventually leading to a crash of the > active RE sooner than on the backup RE, or c) race conditions being > triggered sufficiently differently timing-wise so only active RE > crashes. > > Am I missing something? > > Best regards, > Daniel > > -- > CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
On 2/15/12 10:56 , Daniel Roesen wrote: > On Wed, Feb 15, 2012 at 12:24:50PM -0500, Stefan Fouant wrote: >> The cool thing is the Backup RE is actually listening to all the >> control plane messages coming on fxp1 destined for the Master RE >> and formulating it's own decisions, running its own Dijkstra, >> BGP Path Selection, etc. This is a preferred approach as opposed >> to simply mirroring routing state from the Primary to the Backup >> is because it eliminates fate sharing where there may be a bug >> on the Primary RE, we don't want to create a carbon copy of that >> on the Backup. > > I don't really buy that argument. Running the same code with the same > algorithm against the same data usually leads to the same results. > You'll get full bug redundancy - I'd expect RE crashing simultaneously. > Did NSR protect from any of the recent BGP bugs? > > The advantage I see are less impacting failovers in case of a) hardware > failures of active RE, or b) data structure corruption happening on both > REs [same code => same bugs], but eventually leading to a crash of the > active RE sooner than on the backup RE, or c) race conditions being > triggered sufficiently differently timing-wise so only active RE > crashes. when ISSU actually works it's a godsend. > Am I missing something? > > Best regards, > Daniel > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
On Wed, Feb 15, 2012 at 12:24:50PM -0500, Stefan Fouant wrote: > The cool thing is the Backup RE is actually listening to all the > control plane messages coming on fxp1 destined for the Master RE > and formulating it's own decisions, running its own Dijkstra, > BGP Path Selection, etc. This is a preferred approach as opposed > to simply mirroring routing state from the Primary to the Backup > is because it eliminates fate sharing where there may be a bug > on the Primary RE, we don't want to create a carbon copy of that > on the Backup. I don't really buy that argument. Running the same code with the same algorithm against the same data usually leads to the same results. You'll get full bug redundancy - I'd expect RE crashing simultaneously. Did NSR protect from any of the recent BGP bugs? The advantage I see are less impacting failovers in case of a) hardware failures of active RE, or b) data structure corruption happening on both REs [same code => same bugs], but eventually leading to a crash of the active RE sooner than on the backup RE, or c) race conditions being triggered sufficiently differently timing-wise so only active RE crashes. Am I missing something? Best regards, Daniel -- CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Random BGP peer drops
Hi, http://www.gossamer-threads.com/lists/nsp/juniper/32538#32538 So either solve load/transmission issues or upgrade RR to 10.4 On 14.02.2012 20:55, Serge Vautour wrote: Hello, We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our IGP - all links in area 0. BGP is used for signaling of all L2VPN& VPLS. At this time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. We have M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still on 10.0R3. Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. If 1 link fails, there's always another path. It's been rare but we've seen random iBGP peer drops. The first was several months ago. We've now seen 2 in the last week. 2 of the 3 were related to link failures. The primary path from the PE to the RR failed. BGP timed out after a bit. Here's an example: Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0 Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0 Feb 8 14:06:33 OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: 1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer 0 BGP holdtime is 90sec. This is more than enough time for OSPF to find the other path and converge. The BGP peer came back up before the link so things did eventually converge. The last BGP peer drop happened without any links failure. Out of the blue, BGP just went down. The logs on the PE: Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: 2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer 0 Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle (event HoldTime) Feb 13 20:41:21 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to Established (event RecvKeepAlive) The RR side shows the same: Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from Established to Idle (event RecvNotify) Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: 2149037458, hold timer 1:03.112744 Feb 13 20:41:21 OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from EstabSync to Established (event RsyncAck) Feb 13 20:41:30 OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource temporarily unavailable You can see the peer wasn't down long and re-established on it's own. The logs on the RR make it look like it received a msg from the PE that it was dropping the BGP session. The last error on the RR seems odd as well. Has anyone seen something like this before? We do have a case open regarding a large number of LSA retransmits. TAC is saying this is a bug related to NSR but shouldn't cause any negative impacts. I'm not sure if this is related. I'm considering opening a case for this as well but I'm not very confident I'll get far. Any help would be appreciated. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
Morgan, You are correct if you are running GRES only, however if you enable NSR basically the Backup RE also actively runs rpd and maintains state adjacencies, etc, so in the event of a Primary RE failure you will not need to reestablish adjacencies, etc. The cool thing is the Backup RE is actually listening to all the control plane messages coming on fxp1 destined for the Master RE and formulating it's own decisions, running its own Dijkstra, BGP Path Selection, etc. This is a preferred approach as opposed to simply mirroring routing state from the Primary to the Backup is because it eliminates fate sharing where there may be a bug on the Primary RE, we don't want to create a carbon copy of that on the Backup. Stefan Fouant JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI Technical Trainer, Juniper Networks Follow us on Twitter @JuniperEducate Sent from my iPad On Feb 15, 2012, at 2:56 AM, Morgan McLean wrote: > Correct me if I'm wrong, but backup routing engines never have adjacencies > or peering relationships etc because they are not active, correct? When > they become master they have to reestablish those sessions. Thats how it > seems to be for our SRX routing engines, at least, but routes are shared > between the two so that during the time it takes for those things to > reestablish, the routes are still moving traffic. > > I might be wrong, but that was my impression. > > Morgan > > 2012/2/14 Mohammad > >> Hi everyone >> >> >> >> We have an MX960 with two routing engines, Re0: Backup, Re1: Master >> >> When we try to switchover to the backup RE we see the following message: >> >> XXX# run request chassis routing-engine master switch >> >> error: Standby Routing Engine is not ready for graceful switchover >> (replication_err soft_mask_err) >> >> Toggle mastership between routing engines ? [yes,no] (no) >> >> Noting that we used to switchover between the two Res a day a before with >> no >> issues >> >> >> >> Also, when we login to the re0 (backup) and check the isis, rsvp, etc… we >> see the following: >> >> XXX> request routing-engine login other-routing-engine >> >> € >> >> --- JUNOS 10.2R3.10 built 2010-10-16 19:24:06 UTC >> >> {backup} >> >> XXX> show isis adjacency >> >> >> >> {backup} >> >> XXX> show rsvp session >> >> Ingress RSVP: 0 sessions >> >> Total 0 displayed, Up 0, Down 0 >> >> >> >> Egress RSVP: 0 sessions >> >> Total 0 displayed, Up 0, Down 0 >> >> >> >> Transit RSVP: 0 sessions >> >> Total 0 displayed, Up 0, Down 0 >> >> >> >> {backup} >> >> XXX> >> >> While we can see the bgp routes and L3VPN routes,,, >> >> We have tried to replace the backup with another one, but with the same >> results >> >> Any ideas, this issue is really confusing us, and it is a very critical >> router in our network. >> >> >> >> Thank you in advance >> >> Mohammad Salbad >> >> ___ >> juniper-nsp mailing list juniper-nsp@puck.nether.net >> https://puck.nether.net/mailman/listinfo/juniper-nsp > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] flexible ethernet services / pppoe
Thanks... I may have actually found a "better" way to do this. Is there any reason this wouldn't work? paul@dis1.beachburg1# show description "Wireless Network Trunk"; vlan-tagging; encapsulation flexible-ethernet-services; unit 400 { description Wireless_Public_DHCP; encapsulation vlan-bridge; vlan-id 400; family bridge; } unit 401 { description Wireless_Private_Management; encapsulation vlan-bridge; vlan-id 401; family bridge; } unit 402 { description Wireless_PPPOE; vlan-id 402; family pppoe { dynamic-profile PPPOE; } } Thanks again, Paul -Original Message- From: Per Granath [mailto:per.gran...@gcc.com.cy] Sent: February-15-12 1:54 AM To: Paul Stewart; juniper-nsp@puck.nether.net Subject: RE: [j-nsp] flexible ethernet services / pppoe > I'm trying to work with an interface that has mixed subinterfaces. > some of the subinterfaces are part of a bridge domain, some are family > inet, and one interface is PPPOE for subscriber termination. > > > unit 402 { > description Wireless_PPPOE; > encapsulation ppp-over-ether; > vlan-id 402; > pppoe-underlying-options { > duplicate-protection; > dynamic-profile PPPOE; > } > } > > paul@dis1.beachburg1# commit check > > [edit interfaces ge-1/2/8] > 'unit 402' > Link encapsulation type is not valid for device type > error: configuration check-out failed Try this: [edit interfaces] demux0 { unit 402 { proxy-arp; vlan-id 402; demux-options { underlying-interface ge-1/2/8; } family pppoe { duplicate-protection; dynamic-profile PPPOE; } } } (and remove the other pppoe unit from the physical interface) ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
You can also run the following command on the backup RE to check it's state: me@BLAH-re1> show system switchover Graceful switchover: On Configuration database: Ready Kernel database: Ready Peer state: Steady State If this command and "show task replication" on the master RE don't show the correct outputs, I agree with the recommendation to turn GRES/NSR on/off. If that doesn't work, reboot REs. Serge From: Mohammad To: juniper-nsp@puck.nether.net Sent: Wednesday, February 15, 2012 6:44:42 AM Subject: Re: [j-nsp] MX960 Redundant RE problem Kindly find the following output, I hope it is helpful x> show task replication Stateful Replication: Enabled RE mode: Master Protocol Synchronization Status OSPF Complete BGP Complete IS-IS Complete MPLS Complete RSVP Complete {master} > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Random BGP peer drops
Our NMS gets CPU & Mem usage on both REs on the RR every 5min. The graphs don't show anything abnormal. CPU usage is <5% on both RE and Mem is <25%. Serge From: David Ball To: Serge Vautour Cc: "juniper-nsp@puck.nether.net" Sent: Tuesday, February 14, 2012 4:47:41 PM Subject: Re: [j-nsp] Random BGP peer drops I saw something similar on a T-series w/2 REs running 10.0, and it was related to an NSR bug that was causing the backup RE to thrash and push CPU through the roof on the primary. Also recall a mib2d bug resulting in high CPU, though I'm sure you would have noticed in either case. David On 14 February 2012 15:31, Serge Vautour wrote: > Yes. That was the first thing we checked. I should've mentioned that. > > > Serge > > > > > From: "sth...@nethelp.no" > To: se...@nbnet.nb.ca; sergevaut...@yahoo.ca > Cc: juniper-nsp@puck.nether.net > Sent: Tuesday, February 14, 2012 3:41:02 PM > Subject: Re: [j-nsp] Random BGP peer drops > >> It's been rare but we've seen random iBGP peer drops. The first was >> several months ago. We've now seen 2 in the last week. > > Have you verified that you have a consistent MTU throughout your net? > > Steinar Haug, Nethelp consulting, sth...@nethelp.no > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Random BGP peer drops
We do. It's standard on all our interfaces: myuser@MYPE1-re0> show configuration protocols ospf area 0 interface xe-0/0/0 interface-type p2p; metric 100; ldp-synchronization; Serge From: Addy Mathur To: Serge Vautour Cc: "juniper-nsp@puck.nether.net" Sent: Wednesday, February 15, 2012 10:54:29 AM Subject: Re: [j-nsp] Random BGP peer drops Serge: Do you have ldp synchronization enabled? http://www.juniper.net/techpubs/en_US/junos10.4/topics/usage-guidelines/routing-configuring-synchronization-between-ldp-and-igps.html --Addy. On Tuesday, February 14, 2012, Serge Vautour wrote: > Hello, > > We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our > IGP - all links in area 0. BGP is used for signaling of all L2VPN & VPLS. At > this time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. > We have M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still > on 10.0R3. Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. > If 1 link fails, there's always another path. > > It's been rare but we've seen random iBGP peer drops. The first was several > months ago. We've now seen 2 in the last week. 2 of the 3 were related to > link failures. The primary path from the PE to the RR failed. BGP timed out > after a bit. Here's an example: > > Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: > ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0 > Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: > ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0 > Feb 8 14:06:33 OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: > NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired > Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket > buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: > 1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer > 0 > > BGP holdtime is 90sec. This is more than enough time for OSPF to find the > other path and converge. The BGP peer came back up before the link so things > did eventually converge. > > The last BGP peer drop happened without any links failure. Out of the blue, > BGP just went down. The logs on the PE: > > Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: > NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired > Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket > buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: > 2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer > 0 > Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: > BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle > (event HoldTime) > Feb 13 20:41:21 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: > BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to > Established (event RecvKeepAlive) > > The RR side shows the same: > > Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: > %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS > 123) changed state from Established to Idle (event RecvNotify) > Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: > NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer > Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: > 2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: > 2149037458, hold timer 1:03.112744 > Feb 13 20:41:21 OUR-RR1-re0 rpd[1187]: > %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS > 123) changed state from EstabSync to Established (event RsyncAck) > Feb 13 20:41:30 OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes > to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource > temporarily unavailable > > > You can see the peer wasn't down long and re-established on it's own. The > logs on the RR make it look like it received a msg from the PE that it was > dropping the BGP session. The last error on the RR seems odd as well. > > > Has anyone seen something like this before? We do have a case open regarding > a large number of LSA retransmits. TAC is saying this is a bug related to NSR > but shouldn't cause any negative impacts. I'm not sure if this is related. > I'm considering opening a case for this as well but I'm not very confident > I'll get far. > > > Any help would be appreciated. > > > Thanks, > Serge > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Random BGP peer drops
Serge: Do you have ldp synchronization enabled? http://www.juniper.net/techpubs/en_US/junos10.4/topics/usage-guidelines/routing-configuring-synchronization-between-ldp-and-igps.html --Addy. On Tuesday, February 14, 2012, Serge Vautour wrote: > Hello, > > We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our IGP - all links in area 0. BGP is used for signaling of all L2VPN & VPLS. At this time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. We have M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still on 10.0R3. Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. If 1 link fails, there's always another path. > > It's been rare but we've seen random iBGP peer drops. The first was several months ago. We've now seen 2 in the last week. 2 of the 3 were related to link failures. The primary path from the PE to the RR failed. BGP timed out after a bit. Here's an example: > > Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0 > Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0 > Feb 8 14:06:33 OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: 1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer 0 > > BGP holdtime is 90sec. This is more than enough time for OSPF to find the other path and converge. The BGP peer came back up before the link so things did eventually converge. > > The last BGP peer drop happened without any links failure. Out of the blue, BGP just went down. The logs on the PE: > > Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: 2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer 0 > Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle (event HoldTime) > Feb 13 20:41:21 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to Established (event RecvKeepAlive) > > The RR side shows the same: > > Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from Established to Idle (event RecvNotify) > Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: 2149037458, hold timer 1:03.112744 > Feb 13 20:41:21 OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from EstabSync to Established (event RsyncAck) > Feb 13 20:41:30 OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource temporarily unavailable > > > You can see the peer wasn't down long and re-established on it's own. The logs on the RR make it look like it received a msg from the PE that it was dropping the BGP session. The last error on the RR seems odd as well. > > > Has anyone seen something like this before? We do have a case open regarding a large number of LSA retransmits. TAC is saying this is a bug related to NSR but shouldn't cause any negative impacts. I'm not sure if this is related. I'm considering opening a case for this as well but I'm not very confident I'll get far. > > > Any help would be appreciated. > > > Thanks, > Serge > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] WAN-PHY support for EX-series 10g interfaces
LAN-PHY only on EX4200/4500 as far as i know. -- Tim On Tue, Feb 14, 2012 at 11:53 PM, Dale Shaw wrote: > Hi, > > Potentially odd question here but does anyone know, from 1st hand > experience, whether WAN-PHY mode is supported on 10g interfaces in > EX-series devices? Specifically EX4200 and/or EX4500? > > I ask because we have a new carrier circuit being delivered in the > not-too-distant future and we need to plug something into it to test > it. Eventually we'll jam a SRX5800 with a 4x10GE DPC onto the end of > it but in the meantime it would be handy to terminate and test with > something .. smaller. > > The existing interfaces we have (provisioned before my time) > apparently needed to be configured with the "framing wan-phy" and > "optics-options wavelength 1550.12" configuration options. The framing > command auto-completes on an EX-series box but the optics-options > command is hidden. > > Couldn't find any definitive references in the product docs. > > cheers, > Dale > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] SCB-E
PR718485: Workaround: Disable the "then log" or "then syslog" in firewall configuration. Am Mittwoch, den 15.02.2012, 12:28 +0100 schrieb Per Randrup Nielsen: > PR718485 signature.asc Description: This is a digitally signed message part ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] SCB-E
What workaround is available? /Per -Oprindelig meddelelse- Fra: juniper-nsp-boun...@puck.nether.net [mailto:juniper-nsp-boun...@puck.nether.net] På vegne af Frank Blankman Sendt: 13. februar 2012 18:02 Til: David Temkin Cc: juniper-nsp@puck.nether.net Emne: Re: [j-nsp] SCB-E We've seen PR718485 get hit on a 960 with SCB-E some 16x10G MPC and 11.4R1, though no recurring hits after applying the workaround. Running a config close to Dave's. Frank On Feb 13, 2012, at 3:26 PM, David Temkin wrote: > Not horrible, but similar results in a box with dual SCB-E and 2 16x10G MPCs: > > $ time snmpbulkwalk -v2c -c # x.x.x.x ifHCInOctets > /dev/null > > real0m6.262s > user0m0.028s > sys0m0.018s > > So, it's usable - and I haven't hit any other showstopper bugs thus far - but > I'm running purely IP (two full tables + other associated peers). > > -Dave > > On 2/8/12 6:10 AM, david@orange.com wrote: >> Hi, >> >> Same results on my side... >> >> Just a precision to be sure that there is no mis-understanding : it's not >> triggered by SCB-E, it is a software issue... But currently we only have >> this release to play with SCB-E :-) >> >> >> >> Regards >> David >> >> David Roy >> IP/MPLS Support engineer - Orange France >> Ph. +33 2 99 87 64 72 - Mob. +33 6 85 52 22 13 >> david@orange.com >> >> JNCIE-M&T/SP #703 >> JNCIP-ENT >> >> -Message d'origine- >> De : juniper-nsp-boun...@puck.nether.net >> [mailto:juniper-nsp-boun...@puck.nether.net] De la part de Daniel Roesen >> Envoyé : mercredi 8 février 2012 07:44 >> À : juniper-nsp@puck.nether.net >> Objet : Re: [j-nsp] SCB-E >> >> On Wed, Feb 08, 2012 at 01:23:11AM +, OBrien, Will wrote: >>> Anyone running the SCB-E? I've got a stack of them with a set of fresh >>> MX480s ready to roll out. I'm curious what code your running. >> Given that there is only one public JUNOS release which supports SCB-E, >> there aren't many options: 11.4R1 - and that one has unusable SNMP due to >> new PFE statistics request delays introduced (feature, not bug of >> course!): >> >> foo@lab-MX960> show snmp mib walk ifHCOutOctets | count >> Count: 413 lines >> >> $ time snmpbulkwalk -v2c -c removed x.x.x.x ifHCInOctets> /dev/null >> Timeout: No Response from x.x.x.x >> >> real0m26.380s >> user0m1.647s >> sys 0m0.133s >> >> PR/731833 - fix supposed to come in 11.4R3 slated for May. >> >> So as far as things stand, SCB-E not deployable before mid 2012 earliest if >> (and that's a big "if" when looking at 10.4 experience) 11.4R3 is going to >> be usable. >> >> Ah, and 11.4R1 floods your log with messages like: >> >> mcsn[91713]: %DAEMON-6: krt_decode_nexthop: Try freeing: nh-handle: 0x0 >> nh-index: 1049083 fwdtype: 3 >> >> No idea wether that's service affecting - we haven't observed any impact due >> to that yet. >> >> Best regards, >> Daniel >> >> -- >> CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 >> ___ >> juniper-nsp mailing list juniper-nsp@puck.nether.net >> https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
Kindly find the following output, I hope it is helpful x> show task replication Stateful Replication: Enabled RE mode: Master ProtocolSynchronization Status OSPFComplete BGP Complete IS-IS Complete MPLSComplete RSVPComplete {master} > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
You have GRES enabled and the backup RE was not ready to takeover. See the message in the first lines. Thanks On 2/15/12, Mohammad wrote: > Hi everyone > > > > We have an MX960 with two routing engines, Re0: Backup, Re1: Master > > When we try to switchover to the backup RE we see the following message: > > XXX# run request chassis routing-engine master switch > > error: Standby Routing Engine is not ready for graceful switchover > (replication_err soft_mask_err) > > Toggle mastership between routing engines ? [yes,no] (no) > > Noting that we used to switchover between the two Res a day a before with no > issues > > > > Also, when we login to the re0 (backup) and check the isis, rsvp, etc… we > see the following: > > XXX> request routing-engine login other-routing-engine > > € > > --- JUNOS 10.2R3.10 built 2010-10-16 19:24:06 UTC > > {backup} > > XXX> show isis adjacency > > > > {backup} > > XXX> show rsvp session > > Ingress RSVP: 0 sessions > > Total 0 displayed, Up 0, Down 0 > > > > Egress RSVP: 0 sessions > > Total 0 displayed, Up 0, Down 0 > > > > Transit RSVP: 0 sessions > > Total 0 displayed, Up 0, Down 0 > > > > {backup} > > XXX> > > While we can see the bgp routes and L3VPN routes,,, > > We have tried to replace the backup with another one, but with the same > results > > Any ideas, this issue is really confusing us, and it is a very critical > router in our network. > > > > Thank you in advance > > Mohammad Salbad > > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp -- Sent from my mobile device ./diogo -montagner ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX960 Redundant RE problem
> We have an MX960 with two routing engines, Re0: Backup, Re1: Master > > When we try to switchover to the backup RE we see the following message: > > XXX# run request chassis routing-engine master switch > > error: Standby Routing Engine is not ready for graceful switchover > (replication_err soft_mask_err) > Disable graceful-switchover (and nonstop-routing) and then commit (assuming there is commit synchronize). Then enable it again, commit, and wait for the REs to sync. Something with the kernel database not being healthy, possibly. ...or try JTAC :) ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp