Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Stefan Fouant
I was referring more to a bug in hardware... Bad memory, etc.

Stefan Fouant
JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI
Technical Trainer, Juniper Networks

Follow us on Twitter @JuniperEducate

Sent from my iPad

On Feb 15, 2012, at 1:56 PM, Daniel Roesen  wrote:

> On Wed, Feb 15, 2012 at 12:24:50PM -0500, Stefan Fouant wrote:
>> The cool thing is the Backup RE is actually listening to all the
>> control plane messages coming on fxp1 destined for the Master RE
>> and formulating it's own decisions, running its own Dijkstra,
>> BGP Path Selection, etc. This is a preferred approach as opposed
>> to simply mirroring routing state from the Primary to the Backup
>> is because it eliminates fate sharing where there may be a bug
>> on the Primary RE, we don't want to create a carbon copy of that
>> on the Backup.
> 
> I don't really buy that argument. Running the same code with the same
> algorithm against the same data usually leads to the same results.
> You'll get full bug redundancy - I'd expect RE crashing simultaneously.
> Did NSR protect from any of the recent BGP bugs?
> 
> The advantage I see are less impacting failovers in case of a) hardware
> failures of active RE, or b) data structure corruption happening on both
> REs [same code => same bugs], but eventually leading to a crash of the
> active RE sooner than on the backup RE, or c) race conditions being
> triggered sufficiently differently timing-wise so only active RE
> crashes.
> 
> Am I missing something?
> 
> Best regards,
> Daniel
> 
> -- 
> CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Joel jaeggli
On 2/15/12 10:56 , Daniel Roesen wrote:
> On Wed, Feb 15, 2012 at 12:24:50PM -0500, Stefan Fouant wrote:
>> The cool thing is the Backup RE is actually listening to all the
>> control plane messages coming on fxp1 destined for the Master RE
>> and formulating it's own decisions, running its own Dijkstra,
>> BGP Path Selection, etc. This is a preferred approach as opposed
>> to simply mirroring routing state from the Primary to the Backup
>> is because it eliminates fate sharing where there may be a bug
>> on the Primary RE, we don't want to create a carbon copy of that
>> on the Backup.
> 
> I don't really buy that argument. Running the same code with the same
> algorithm against the same data usually leads to the same results.
> You'll get full bug redundancy - I'd expect RE crashing simultaneously.
> Did NSR protect from any of the recent BGP bugs?
> 
> The advantage I see are less impacting failovers in case of a) hardware
> failures of active RE, or b) data structure corruption happening on both
> REs [same code => same bugs], but eventually leading to a crash of the
> active RE sooner than on the backup RE, or c) race conditions being
> triggered sufficiently differently timing-wise so only active RE
> crashes.

when ISSU actually works it's a godsend.

> Am I missing something?
> 
> Best regards,
> Daniel
> 

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Daniel Roesen
On Wed, Feb 15, 2012 at 12:24:50PM -0500, Stefan Fouant wrote:
> The cool thing is the Backup RE is actually listening to all the
> control plane messages coming on fxp1 destined for the Master RE
> and formulating it's own decisions, running its own Dijkstra,
> BGP Path Selection, etc. This is a preferred approach as opposed
> to simply mirroring routing state from the Primary to the Backup
> is because it eliminates fate sharing where there may be a bug
> on the Primary RE, we don't want to create a carbon copy of that
> on the Backup.

I don't really buy that argument. Running the same code with the same
algorithm against the same data usually leads to the same results.
You'll get full bug redundancy - I'd expect RE crashing simultaneously.
Did NSR protect from any of the recent BGP bugs?

The advantage I see are less impacting failovers in case of a) hardware
failures of active RE, or b) data structure corruption happening on both
REs [same code => same bugs], but eventually leading to a crash of the
active RE sooner than on the backup RE, or c) race conditions being
triggered sufficiently differently timing-wise so only active RE
crashes.

Am I missing something?

Best regards,
Daniel

-- 
CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] Random BGP peer drops

2012-02-15 Thread Tima Maryin

Hi,


http://www.gossamer-threads.com/lists/nsp/juniper/32538#32538

So either solve load/transmission issues or upgrade RR to 10.4



On 14.02.2012 20:55, Serge Vautour wrote:

Hello,

We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our IGP - 
all links in area 0. BGP is used for signaling of all L2VPN&  VPLS. At this 
time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. We have 
M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still on 10.0R3. 
Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. If 1 link fails, 
there's always another path.

It's been rare but we've seen random iBGP peer drops. The first was several 
months ago. We've now seen 2 in the last week. 2 of the 3 were related to link 
failures. The primary path from the PE to the RR failed. BGP timed out after a 
bit. Here's an example:

Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 
129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0
Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 
120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0
Feb  8 14:06:33  OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: 
NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired 
Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer 
sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: 1056225956 
snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer 0

BGP holdtime is 90sec. This is more than enough time for OSPF to find the other 
path and converge. The BGP peer came back up before the link so things did 
eventually converge.

The last BGP peer drop happened without any links failure. Out of the blue, BGP 
just went down. The logs on the PE:

Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: 
NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired 
Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer 
sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: 2149021074 
snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer 0
Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: 
BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle 
(event HoldTime)
Feb 13 20:41:21  OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: 
BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to 
Established (event RecvKeepAlive)

The RR side shows the same:

Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: 
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) 
changed state from Established to Idle (event RecvNotify)
Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: 
NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer 
Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 
2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: 
2149037458, hold timer 1:03.112744
Feb 13 20:41:21  OUR-RR1-re0 rpd[1187]: 
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) 
changed state from EstabSync to Established (event RsyncAck)
Feb 13 20:41:30  OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes 
to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource 
temporarily unavailable


You can see the peer wasn't down long and re-established on it's own. The logs 
on the RR make it look like it received a msg from the PE that it was dropping 
the BGP session. The last error on the RR seems odd as well.


Has anyone seen something like this before? We do have a case open regarding a 
large number of LSA retransmits. TAC is saying this is a bug related to NSR but 
shouldn't cause any negative impacts. I'm not sure if this is related. I'm 
considering opening a case for this as well but I'm not very confident I'll get 
far.


Any help would be appreciated.

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Stefan Fouant
Morgan,

You are correct if you are running GRES only, however if you enable NSR 
basically the Backup RE also actively runs rpd and maintains state adjacencies, 
etc, so in the event of a Primary RE failure you will not need to reestablish  
adjacencies, etc.

The cool thing is the Backup RE is actually listening to all the control plane 
messages coming on fxp1 destined for the Master RE and formulating it's own 
decisions, running its own Dijkstra, BGP Path Selection, etc. This is a 
preferred approach as opposed to simply mirroring routing state from the 
Primary to the Backup is because it eliminates fate sharing where there may be 
a bug on the Primary RE, we don't want to create a carbon copy of that on the 
Backup.

Stefan Fouant
JNCIE-SEC, JNCIE-SP, JNCIE-ER, JNCI
Technical Trainer, Juniper Networks

Follow us on Twitter @JuniperEducate

Sent from my iPad

On Feb 15, 2012, at 2:56 AM, Morgan McLean  wrote:

> Correct me if I'm wrong, but backup routing engines never have adjacencies
> or peering relationships etc because they are not active, correct? When
> they become master they have to reestablish those sessions. Thats how it
> seems to be for our SRX routing engines, at least, but routes are shared
> between the two so that during the time it takes for those things to
> reestablish, the routes are still moving traffic.
> 
> I might be wrong, but that was my impression.
> 
> Morgan
> 
> 2012/2/14 Mohammad 
> 
>> Hi everyone
>> 
>> 
>> 
>> We have an MX960 with two routing engines, Re0: Backup, Re1: Master
>> 
>> When we try to switchover to the backup RE we see the following message:
>> 
>> XXX# run request chassis routing-engine master switch
>> 
>> error: Standby Routing Engine is not ready for graceful switchover
>> (replication_err soft_mask_err)
>> 
>> Toggle mastership between routing engines ? [yes,no] (no)
>> 
>> Noting that we used to switchover between the two Res a day a before with
>> no
>> issues
>> 
>> 
>> 
>> Also, when we login to the re0 (backup) and check the isis, rsvp, etc… we
>> see the following:
>> 
>> XXX> request routing-engine login other-routing-engine
>> 
>> €
>> 
>> --- JUNOS 10.2R3.10 built 2010-10-16 19:24:06 UTC
>> 
>> {backup}
>> 
>> XXX> show isis adjacency
>> 
>> 
>> 
>> {backup}
>> 
>> XXX> show rsvp session
>> 
>> Ingress RSVP: 0 sessions
>> 
>> Total 0 displayed, Up 0, Down 0
>> 
>> 
>> 
>> Egress RSVP: 0 sessions
>> 
>> Total 0 displayed, Up 0, Down 0
>> 
>> 
>> 
>> Transit RSVP: 0 sessions
>> 
>> Total 0 displayed, Up 0, Down 0
>> 
>> 
>> 
>> {backup}
>> 
>> XXX>
>> 
>> While we can see the bgp routes and L3VPN routes,,,
>> 
>> We have tried to replace the backup with another one, but with the same
>> results
>> 
>> Any ideas, this issue is really confusing us, and it is a very critical
>> router in our network.
>> 
>> 
>> 
>> Thank you in advance
>> 
>> Mohammad Salbad
>> 
>> ___
>> juniper-nsp mailing list juniper-nsp@puck.nether.net
>> https://puck.nether.net/mailman/listinfo/juniper-nsp
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] flexible ethernet services / pppoe

2012-02-15 Thread Paul Stewart
Thanks... I may have actually found a "better" way to do this.  Is there any
reason this wouldn't work?

paul@dis1.beachburg1# show
description "Wireless Network Trunk";
vlan-tagging;
encapsulation flexible-ethernet-services;
unit 400 {
description Wireless_Public_DHCP;
encapsulation vlan-bridge;
vlan-id 400;
family bridge;
}
unit 401 {
description Wireless_Private_Management;
encapsulation vlan-bridge;
vlan-id 401;
family bridge;
}
unit 402 {
description Wireless_PPPOE;
vlan-id 402;
family pppoe {
dynamic-profile PPPOE;
}
}

Thanks again,
Paul


-Original Message-
From: Per Granath [mailto:per.gran...@gcc.com.cy] 
Sent: February-15-12 1:54 AM
To: Paul Stewart; juniper-nsp@puck.nether.net
Subject: RE: [j-nsp] flexible ethernet services / pppoe

> I'm trying to work with an interface that has mixed subinterfaces. 
> some of the subinterfaces are part of a bridge domain, some are family 
> inet, and one interface is PPPOE for subscriber termination.
> 
> 
> unit 402 {
> description Wireless_PPPOE;
> encapsulation ppp-over-ether;
> vlan-id 402;
> pppoe-underlying-options {
> duplicate-protection;
> dynamic-profile PPPOE;
> }
> }
> 
> paul@dis1.beachburg1# commit check
> 
> [edit interfaces ge-1/2/8]
>   'unit 402'
>  Link encapsulation type is not valid for device type
> error: configuration check-out failed

Try this:

[edit interfaces]
demux0 {
unit 402 {
proxy-arp;
vlan-id 402;
demux-options {
underlying-interface ge-1/2/8;
}
family pppoe {
duplicate-protection;
dynamic-profile PPPOE;
}
}
}

(and remove the other pppoe unit from the physical interface)

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Serge Vautour
You can also run the following command on the backup RE to check it's state:

me@BLAH-re1> show system switchover 
Graceful switchover: On
Configuration database: Ready
Kernel database: Ready
Peer state: Steady State


If this command and "show task replication" on the master RE don't show the 
correct outputs, I agree with the recommendation to turn GRES/NSR on/off. If 
that doesn't work, reboot REs.


Serge




 From: Mohammad 
To: juniper-nsp@puck.nether.net 
Sent: Wednesday, February 15, 2012 6:44:42 AM
Subject: Re: [j-nsp] MX960 Redundant RE problem
 
Kindly find the following output, I hope it is helpful
x> show task replication 
        Stateful Replication: Enabled
        RE mode: Master

    Protocol                Synchronization Status
    OSPF                    Complete              
    BGP                     Complete              
    IS-IS                   Complete              
    MPLS                    Complete              
    RSVP                    Complete              

{master}
>


___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] Random BGP peer drops

2012-02-15 Thread Serge Vautour
Our NMS gets CPU & Mem usage on both REs on the RR every 5min. The graphs don't 
show anything abnormal. CPU usage is <5% on both RE and Mem is <25%.

Serge




 From: David Ball 
To: Serge Vautour  
Cc: "juniper-nsp@puck.nether.net"  
Sent: Tuesday, February 14, 2012 4:47:41 PM
Subject: Re: [j-nsp] Random BGP peer drops
 
  I saw something similar on a T-series w/2 REs running 10.0, and it
was related to an NSR bug that was causing the backup RE to thrash and
push CPU through the roof on the primary.  Also recall a mib2d bug
resulting in high CPU, though I'm sure you would have noticed in
either case.

David


On 14 February 2012 15:31, Serge Vautour  wrote:
> Yes. That was the first thing we checked. I should've mentioned that.
>
>
> Serge
>
>
>
> 
>  From: "sth...@nethelp.no" 
> To: se...@nbnet.nb.ca; sergevaut...@yahoo.ca
> Cc: juniper-nsp@puck.nether.net
> Sent: Tuesday, February 14, 2012 3:41:02 PM
> Subject: Re: [j-nsp] Random BGP peer drops
>
>> It's been rare but we've seen random iBGP peer drops. The first was
>> several months ago. We've now seen 2 in the last week.
>
> Have you verified that you have a consistent MTU throughout your net?
>
> Steinar Haug, Nethelp consulting, sth...@nethelp.no
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] Random BGP peer drops

2012-02-15 Thread Serge Vautour
We do. It's standard on all our interfaces:

myuser@MYPE1-re0> show configuration protocols ospf area 0 interface xe-0/0/0 
interface-type p2p;
metric 100;
ldp-synchronization;


Serge




 From: Addy Mathur 
To: Serge Vautour  
Cc: "juniper-nsp@puck.nether.net"  
Sent: Wednesday, February 15, 2012 10:54:29 AM
Subject: Re: [j-nsp] Random BGP peer drops
 

Serge:

Do you have ldp synchronization enabled?

http://www.juniper.net/techpubs/en_US/junos10.4/topics/usage-guidelines/routing-configuring-synchronization-between-ldp-and-igps.html

--Addy.

On Tuesday, February 14, 2012, Serge Vautour  wrote:
> Hello,
>
> We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our 
> IGP - all links in area 0. BGP is used for signaling of all L2VPN & VPLS. At 
> this time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. 
> We have M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still 
> on 10.0R3. Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. 
> If 1 link fails, there's always another path.
>
> It's been rare but we've seen random iBGP peer drops. The first was several 
> months ago. We've now seen 2 in the last week. 2 of the 3 were related to 
> link failures. The primary path from the PE to the RR failed. BGP timed out 
> after a bit. Here's an example:
>
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: 
> ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: 
> ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0
> Feb  8 14:06:33  OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: 
> NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired 
> Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket 
> buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: 
> 1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer > 0
>
> BGP holdtime is 90sec. This is more than enough time for OSPF to find the 
> other path and converge. The BGP peer came back up before the link so things 
> did eventually converge.
>
> The last BGP peer drop happened without any links failure. Out of the blue, 
> BGP just went down. The logs on the PE:
>
> Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: 
> NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired 
> Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket 
> buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: 
> 2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer > 0
> Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: 
> BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle 
> (event HoldTime)
> Feb 13 20:41:21  OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: 
> BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to 
> Established (event RecvKeepAlive)
>
> The RR side shows the same:
>
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: 
> %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 
> 123) changed state from Established to Idle (event RecvNotify)
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: 
> NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer 
> Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 
> 2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: 
> 2149037458, hold timer 1:03.112744
> Feb 13 20:41:21  OUR-RR1-re0 rpd[1187]: 
> %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 
> 123) changed state from EstabSync to Established (event RsyncAck)
> Feb 13 20:41:30  OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes 
> to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource 
> temporarily unavailable
>
>
> You can see the peer wasn't down long and re-established on it's own. The 
> logs on the RR make it look like it received a msg from the PE that it was 
> dropping the BGP session. The last error on the RR seems odd as well.
>
>
> Has anyone seen something like this before? We do have a case open regarding 
> a large number of LSA retransmits. TAC is saying this is a bug related to NSR 
> but shouldn't cause any negative impacts. I'm not sure if this is related. 
> I'm considering opening a case for this as well but I'm not very confident 
> I'll get far.
>
>
> Any help would be appreciated.
>
>
> Thanks,
> Serge
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
> 
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] Random BGP peer drops

2012-02-15 Thread Addy Mathur
Serge:

Do you have ldp synchronization enabled?

http://www.juniper.net/techpubs/en_US/junos10.4/topics/usage-guidelines/routing-configuring-synchronization-between-ldp-and-igps.html

--Addy.

On Tuesday, February 14, 2012, Serge Vautour  wrote:
> Hello,
>
> We have an MPLS network made up of many MX960s and MX80s. We run OSPF as
our IGP - all links in area 0. BGP is used for signaling of all L2VPN &
VPLS. At this time we only have 1 L3VPN for mgmt. LDP is used for for
transport LSPs. We have M10i as dedicated Route Reflectors. Most MX are on
10.4S5. M10i still on 10.0R3. Each PE peers with 2 RRs and has 2 diverse
uplinks for redundancy. If 1 link fails, there's always another path.
>
> It's been rare but we've seen random iBGP peer drops. The first was
several months ago. We've now seen 2 in the last week. 2 of the 3 were
related to link failures. The primary path from the PE to the RR failed.
BGP timed out after a bit. Here's an example:
>
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN:
ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN:
ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0
> Feb  8 14:06:33  OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660:
NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired
Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket
buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt:
1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold
timer 0
>
> BGP holdtime is 90sec. This is more than enough time for OSPF to find the
other path and converge. The BGP peer came back up before the link so
things did eventually converge.
>
> The last BGP peer drop happened without any links failure. Out of the
blue, BGP just went down. The logs on the PE:
>
> Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660:
NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired
Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket
buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt:
2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold
timer 0
> Feb 13 20:40:48  OUR-PE1 rpd[1159]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS
123) changed state from Established to Idle (event HoldTime)
> Feb 13 20:41:21  OUR-PE1 rpd[1159]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS
123) changed state from OpenConfirm to Established (event RecvKeepAlive)
>
> The RR side shows the same:
>
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS
123) changed state from Established to Idle (event RecvNotify)
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: %DAEMON-4:
bgp_read_v4_message:8927: NOTIFICATION received from 10.1.1.61 (Internal AS
123): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0
TCP state: 4, snd_una: 2049196833 snd_nxt: 2049196871 snd_wnd: 16384
rcv_nxt: 2149021095 rcv_adv: 2149037458, hold timer 1:03.112744
> Feb 13 20:41:21  OUR-RR1-re0 rpd[1187]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS
123) changed state from EstabSync to Established (event RsyncAck)
> Feb 13 20:41:30  OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30
bytes to 10.1.1.61 (Internal AS 123) blocked (no spooling requested):
Resource temporarily unavailable
>
>
> You can see the peer wasn't down long and re-established on it's own. The
logs on the RR make it look like it received a msg from the PE that it was
dropping the BGP session. The last error on the RR seems odd as well.
>
>
> Has anyone seen something like this before? We do have a case open
regarding a large number of LSA retransmits. TAC is saying this is a bug
related to NSR but shouldn't cause any negative impacts. I'm not sure if
this is related. I'm considering opening a case for this as well but I'm
not very confident I'll get far.
>
>
> Any help would be appreciated.
>
>
> Thanks,
> Serge
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] WAN-PHY support for EX-series 10g interfaces

2012-02-15 Thread Tim Jackson
LAN-PHY only on EX4200/4500 as far as i know.

--
Tim

On Tue, Feb 14, 2012 at 11:53 PM, Dale Shaw  wrote:
> Hi,
>
> Potentially odd question here but does anyone know, from 1st hand
> experience, whether WAN-PHY mode is supported on 10g interfaces in
> EX-series devices? Specifically EX4200 and/or EX4500?
>
> I ask because we have a new carrier circuit being delivered in the
> not-too-distant future and we need to plug something into it to test
> it. Eventually we'll jam a SRX5800 with a 4x10GE DPC onto the end of
> it but in the meantime it would be handy to terminate and test with
> something .. smaller.
>
> The existing interfaces we have (provisioned before my time)
> apparently needed to be configured with the "framing wan-phy" and
> "optics-options wavelength 1550.12" configuration options. The framing
> command auto-completes on an EX-series box but the optics-options
> command is hidden.
>
> Couldn't find any definitive references in the product docs.
>
> cheers,
> Dale
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] SCB-E

2012-02-15 Thread Jonas Frey (Probe Networks)
PR718485:
Workaround:
Disable the "then log" or "then syslog" in firewall configuration.


Am Mittwoch, den 15.02.2012, 12:28 +0100 schrieb Per Randrup Nielsen:
> PR718485


signature.asc
Description: This is a digitally signed message part
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] SCB-E

2012-02-15 Thread Per Randrup Nielsen
What workaround is available?

/Per

-Oprindelig meddelelse-
Fra: juniper-nsp-boun...@puck.nether.net 
[mailto:juniper-nsp-boun...@puck.nether.net] På vegne af Frank Blankman
Sendt: 13. februar 2012 18:02
Til: David Temkin
Cc: juniper-nsp@puck.nether.net
Emne: Re: [j-nsp] SCB-E

We've seen PR718485 get hit on a 960 with SCB-E some 16x10G MPC and 11.4R1, 
though no recurring hits after applying the workaround. Running a config close 
to Dave's.

Frank


On Feb 13, 2012, at 3:26 PM, David Temkin wrote:
> Not horrible, but similar results in a box with dual SCB-E and 2 16x10G MPCs:
> 
> $ time snmpbulkwalk -v2c -c # x.x.x.x ifHCInOctets > /dev/null
> 
> real0m6.262s
> user0m0.028s
> sys0m0.018s
> 
> So, it's usable - and I haven't hit any other showstopper bugs thus far - but 
> I'm running purely IP (two full tables + other associated peers).
> 
> -Dave
> 
> On 2/8/12 6:10 AM, david@orange.com wrote:
>> Hi,
>> 
>> Same results on my side...
>> 
>> Just a precision to be sure that there is no mis-understanding : it's not 
>> triggered by SCB-E, it is a software issue... But currently we only have 
>> this release to play with SCB-E :-)
>> 
>> 
>> 
>> Regards
>> David
>> 
>> David Roy
>> IP/MPLS Support engineer - Orange France
>> Ph. +33 2 99 87 64 72 - Mob. +33 6 85 52 22 13
>> david@orange.com
>> 
>> JNCIE-M&T/SP #703
>> JNCIP-ENT
>> 
>> -Message d'origine-
>> De : juniper-nsp-boun...@puck.nether.net 
>> [mailto:juniper-nsp-boun...@puck.nether.net] De la part de Daniel Roesen
>> Envoyé : mercredi 8 février 2012 07:44
>> À : juniper-nsp@puck.nether.net
>> Objet : Re: [j-nsp] SCB-E
>> 
>> On Wed, Feb 08, 2012 at 01:23:11AM +, OBrien, Will wrote:
>>> Anyone running the SCB-E? I've got a stack of them with a set of fresh
>>> MX480s ready to roll out. I'm curious what code your running.
>> Given that there is only one public JUNOS release which supports SCB-E, 
>> there aren't many options: 11.4R1 - and that one has unusable SNMP due to 
>> new PFE statistics request delays introduced (feature, not bug of
>> course!):
>> 
>> foo@lab-MX960>  show snmp mib walk ifHCOutOctets | count
>> Count: 413 lines
>> 
>> $ time snmpbulkwalk -v2c -c removed x.x.x.x ifHCInOctets>  /dev/null
>> Timeout: No Response from x.x.x.x
>> 
>> real0m26.380s
>> user0m1.647s
>> sys 0m0.133s
>> 
>> PR/731833 - fix supposed to come in 11.4R3 slated for May.
>> 
>> So as far as things stand, SCB-E not deployable before mid 2012 earliest if 
>> (and that's a big "if" when looking at 10.4 experience) 11.4R3 is going to 
>> be usable.
>> 
>> Ah, and 11.4R1 floods your log with messages like:
>> 
>> mcsn[91713]: %DAEMON-6: krt_decode_nexthop: Try freeing: nh-handle: 0x0
>> nh-index: 1049083 fwdtype: 3
>> 
>> No idea wether that's service affecting - we haven't observed any impact due 
>> to that yet.
>> 
>> Best regards,
>> Daniel
>> 
>> --
>> CLUE-RIPE -- Jabber: d...@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0 
>> ___
>> juniper-nsp mailing list juniper-nsp@puck.nether.net 
>> https://puck.nether.net/mailman/listinfo/juniper-nsp

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Mohammad
Kindly find the following output, I hope it is helpful
x> show task replication 
Stateful Replication: Enabled
RE mode: Master

ProtocolSynchronization Status
OSPFComplete  
BGP Complete  
IS-IS   Complete  
MPLSComplete  
RSVPComplete  

{master}
>


___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Diogo Montagner
You have GRES enabled and the backup RE was not ready to takeover. See
the message in the first lines.

Thanks

On 2/15/12, Mohammad  wrote:
> Hi everyone
>
>
>
> We have an MX960 with two routing engines, Re0: Backup, Re1: Master
>
> When we try to switchover to the backup RE we see the following message:
>
> XXX# run request chassis routing-engine master switch
>
> error: Standby Routing Engine is not ready for graceful switchover
> (replication_err soft_mask_err)
>
> Toggle mastership between routing engines ? [yes,no] (no)
>
> Noting that we used to switchover between the two Res a day a before with no
> issues
>
>
>
> Also, when we login to the re0 (backup) and check the isis, rsvp, etc… we
> see the following:
>
> XXX> request routing-engine login other-routing-engine
>
> €
>
> --- JUNOS 10.2R3.10 built 2010-10-16 19:24:06 UTC
>
> {backup}
>
> XXX> show isis adjacency
>
>
>
> {backup}
>
> XXX> show rsvp session
>
> Ingress RSVP: 0 sessions
>
> Total 0 displayed, Up 0, Down 0
>
>
>
> Egress RSVP: 0 sessions
>
> Total 0 displayed, Up 0, Down 0
>
>
>
> Transit RSVP: 0 sessions
>
> Total 0 displayed, Up 0, Down 0
>
>
>
> {backup}
>
> XXX>
>
> While we can see the bgp routes and L3VPN routes,,,
>
> We have tried to replace the backup with another one, but with the same
> results
>
> Any ideas, this issue is really confusing us, and it is a very critical
> router in our network.
>
>
>
> Thank you in advance
>
> Mohammad Salbad
>
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp

-- 
Sent from my mobile device

./diogo -montagner

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX960 Redundant RE problem

2012-02-15 Thread Per Granath
> We have an MX960 with two routing engines, Re0: Backup, Re1: Master
> 
> When we try to switchover to the backup RE we see the following message:
> 
> XXX# run request chassis routing-engine master switch
> 
> error: Standby Routing Engine is not ready for graceful switchover
> (replication_err soft_mask_err)
> 

Disable graceful-switchover (and nonstop-routing) and then commit (assuming 
there is commit synchronize).
Then enable it again, commit, and wait for the REs to sync.

Something with the kernel database not being healthy, possibly.

...or try JTAC :)

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp