[j-nsp] RPD Crash on M320

2016-01-04 Thread Alireza Soltanian
Hi everybody

Recently, we had continuous link flap between our M320 and remote sites. We
have a lot of L2Circuits between these sites on our M320. At one point we
had crash on RPD process which lead to following log. I must mention the
link flap started at 12:10AM and it was continued until 2:30AM. But Crash
was occurred at 12:30AM.

 

Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
10.237.253.168 is down, reason: received notification from peer

Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
10.237.254.1 is down, reason: received notification from peer

Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
10.237.253.120 is down, reason: received notification from peer

Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK message on non-active socket w/handle 0x1008af801c6

Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
10.237.253.192 is down, reason: received notification from peer

Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK message on non-active socket w/handle 0x10046fa004e

 

Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by signal
number 6. Core dumped!

Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started

Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary

Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
primary

Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: lost
ifl 0 for route (null)

Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times

Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary

Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
L2circuit IFL Repository

Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing routing
updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder

 

Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041

Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039

Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038

 

The case is we always have this kind of log (except the Crash) on the
device. Is there any clue why RPD process crashed? I don't have access to
JTAC so I cannot analyze the dump.

The JunOS version is : 11.2R2.4

 

Thank you for your help and support

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] RPD Crash on M320

2016-01-04 Thread Alireza Soltanian
Just asking. Anyway any idea about my comments? Also is there any mechanism
or approach for dealing with these kind of situations?
On Jan 4, 2016 6:45 PM, "Niall Donaghy" <niall.dona...@geant.org> wrote:

> Reading the core dump is beyond my expertise I’m afraid.
>
>
>
> Br,
>
> Niall
>
>
>
> *From:* Alireza Soltanian [mailto:soltan...@gmail.com]
> *Sent:* 04 January 2016 15:14
> *To:* Niall Donaghy
> *Cc:* juniper-nsp@puck.nether.net
> *Subject:* RE: [j-nsp] RPD Crash on M320
>
>
>
> Hi
> Yes I checked the CPU graph and there was a spike on CPU load.
> The link was flappy 20 minutes before crash. Also it remained flappy two
> hours after this crash. During this time we can see LDP sessions go UP DOWN
> over and over. But the only time there was a crash was this time and there
> is no spike on CPU.
> I must mention we had another issue with another M320. Whenever a link
> flapped, CPU of RPD went high and all OSPF sessions reset. I found out the
> root cause for that. It was traceoption for LDP. For this box we dont use
> traceoption.
> Is there any way to read the dump?
>
> Thank you
>
> On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org> wrote:
>
> Hi Alireza,
>
> It seemed to me this event could be related to the core dump: Jan  3
> 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
> message on non-active socket w/handle 0x10046fa004e
> However upon further investigation
> (http://kb.juniper.net/InfoCenter/index?page=content=KB18195) I see
> these
> messages are normal/harmless.
>
> Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
> crash? Link flapping may be giving rise to CPU hogging, leading to
> instability and subsequent rpd crash.
> Was the link particularly flappy just before the crash?
>
> Kind regards,
> Niall
>
>
>
>
> > -Original Message-
> > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf
> Of
> > Alireza Soltanian
> > Sent: 04 January 2016 11:04
> > To: juniper-nsp@puck.nether.net
> > Subject: [j-nsp] RPD Crash on M320
> >
> > Hi everybody
> >
> > Recently, we had continuous link flap between our M320 and remote sites.
> We
> > have a lot of L2Circuits between these sites on our M320. At one point we
> had
> > crash on RPD process which lead to following log. I must mention the link
> flap
> > started at 12:10AM and it was continued until 2:30AM. But Crash was
> occurred
> > at 12:30AM.
> >
> >
> >
> > Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.168 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.254.1 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.120 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x1008af801c6
> >
> > Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.192 is down, reason: received notification from peer
> >
> > Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x10046fa004e
> >
> >
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
> signal
> > number 6. Core dumped!
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> > lost ifl 0 for route (null)
> >
> > Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
> L2circuit IFL
> > Repository
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
> routing
> > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
> >
> >
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1329, ifAdminStatus up(1

Re: [j-nsp] RPD Crash on M320

2016-01-04 Thread Niall Donaghy
Reading the core dump is beyond my expertise I’m afraid.

 

Br,

Niall

 

From: Alireza Soltanian [mailto:soltan...@gmail.com] 
Sent: 04 January 2016 15:14
To: Niall Donaghy
Cc: juniper-nsp@puck.nether.net
Subject: RE: [j-nsp] RPD Crash on M320

 

Hi
Yes I checked the CPU graph and there was a spike on CPU load. 
The link was flappy 20 minutes before crash. Also it remained flappy two hours 
after this crash. During this time we can see LDP sessions go UP DOWN over and 
over. But the only time there was a crash was this time and there is no spike 
on CPU.
I must mention we had another issue with another M320. Whenever a link flapped, 
CPU of RPD went high and all OSPF sessions reset. I found out the root cause 
for that. It was traceoption for LDP. For this box we dont use traceoption.
Is there any way to read the dump?

Thank you

On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org 
<mailto:niall.dona...@geant.org> > wrote:

Hi Alireza,

It seemed to me this event could be related to the core dump: Jan  3
00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
message on non-active socket w/handle 0x10046fa004e
However upon further investigation
(http://kb.juniper.net/InfoCenter/index?page=content 
<http://kb.juniper.net/InfoCenter/index?page=content=KB18195> =KB18195) I 
see these
messages are normal/harmless.

Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
crash? Link flapping may be giving rise to CPU hogging, leading to
instability and subsequent rpd crash.
Was the link particularly flappy just before the crash?

Kind regards,
Niall




> -Original Message-
> From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net 
> <mailto:juniper-nsp-boun...@puck.nether.net> ] On Behalf
Of
> Alireza Soltanian
> Sent: 04 January 2016 11:04
> To: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> 
> Subject: [j-nsp] RPD Crash on M320
>
> Hi everybody
>
> Recently, we had continuous link flap between our M320 and remote sites.
We
> have a lot of L2Circuits between these sites on our M320. At one point we
had
> crash on RPD process which lead to following log. I must mention the link
flap
> started at 12:10AM and it was continued until 2:30AM. But Crash was
occurred
> at 12:30AM.
>
>
>
> Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.168 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.254.1 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.120 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x1008af801c6
>
> Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.192 is down, reason: received notification from peer
>
> Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x10046fa004e
>
>
>
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
signal
> number 6. Core dumped!
>
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
>
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
>
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
primary
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> lost ifl 0 for route (null)
>
> Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
L2circuit IFL
> Repository
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
routing
> updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
>
>
>
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041
>
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039
>
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038
>
>
>
> The case is we always have this kind of log (except the Crash) on the
device. Is
> there any clue why RPD process crashed? I don't have access to JTAC so I
cannot
> analyze the dump.
>
> The JunOS version is : 11.2R2.4
>
>
>
> Thank you for your help and support
&g

Re: [j-nsp] RPD Crash on M320

2016-01-04 Thread Alireza Soltanian
Hi
Yes I checked the CPU graph and there was a spike on CPU load.
The link was flappy 20 minutes before crash. Also it remained flappy two
hours after this crash. During this time we can see LDP sessions go UP DOWN
over and over. But the only time there was a crash was this time and there
is no spike on CPU.
I must mention we had another issue with another M320. Whenever a link
flapped, CPU of RPD went high and all OSPF sessions reset. I found out the
root cause for that. It was traceoption for LDP. For this box we dont use
traceoption.
Is there any way to read the dump?

Thank you
On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org> wrote:

> Hi Alireza,
>
> It seemed to me this event could be related to the core dump: Jan  3
> 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
> message on non-active socket w/handle 0x10046fa004e
> However upon further investigation
> (http://kb.juniper.net/InfoCenter/index?page=content=KB18195) I see
> these
> messages are normal/harmless.
>
> Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
> crash? Link flapping may be giving rise to CPU hogging, leading to
> instability and subsequent rpd crash.
> Was the link particularly flappy just before the crash?
>
> Kind regards,
> Niall
>
>
>
>
> > -Original Message-
> > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf
> Of
> > Alireza Soltanian
> > Sent: 04 January 2016 11:04
> > To: juniper-nsp@puck.nether.net
> > Subject: [j-nsp] RPD Crash on M320
> >
> > Hi everybody
> >
> > Recently, we had continuous link flap between our M320 and remote sites.
> We
> > have a lot of L2Circuits between these sites on our M320. At one point we
> had
> > crash on RPD process which lead to following log. I must mention the link
> flap
> > started at 12:10AM and it was continued until 2:30AM. But Crash was
> occurred
> > at 12:30AM.
> >
> >
> >
> > Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.168 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.254.1 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.120 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x1008af801c6
> >
> > Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.192 is down, reason: received notification from peer
> >
> > Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x10046fa004e
> >
> >
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
> signal
> > number 6. Core dumped!
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> > lost ifl 0 for route (null)
> >
> > Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
> L2circuit IFL
> > Repository
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
> routing
> > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
> >
> >
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038
> >
> >
> >
> > The case is we always have this kind of log (except the Crash) on the
> device. Is
> > there any clue why RPD process crashed? I don't have access to JTAC so I
> cannot
> > analyze the dump.
> >
> > The JunOS version is : 11.2R2.4
> >
> >
> >
> > Thank you for your help and support
> >
> > ___
> > juniper-nsp mailing list juniper-nsp@puck.nether.net
> > https://puck.nether.net/mailman/listinfo/juniper-nsp
>
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] RPD Crash on M320

2016-01-04 Thread Niall Donaghy
Hi Alireza,

It seemed to me this event could be related to the core dump: Jan  3
00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
message on non-active socket w/handle 0x10046fa004e
However upon further investigation
(http://kb.juniper.net/InfoCenter/index?page=content=KB18195) I see these
messages are normal/harmless.

Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
crash? Link flapping may be giving rise to CPU hogging, leading to
instability and subsequent rpd crash.
Was the link particularly flappy just before the crash?

Kind regards,
Niall




> -Original Message-
> From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf
Of
> Alireza Soltanian
> Sent: 04 January 2016 11:04
> To: juniper-nsp@puck.nether.net
> Subject: [j-nsp] RPD Crash on M320
> 
> Hi everybody
> 
> Recently, we had continuous link flap between our M320 and remote sites.
We
> have a lot of L2Circuits between these sites on our M320. At one point we
had
> crash on RPD process which lead to following log. I must mention the link
flap
> started at 12:10AM and it was continued until 2:30AM. But Crash was
occurred
> at 12:30AM.
> 
> 
> 
> Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.168 is down, reason: received notification from peer
> 
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.254.1 is down, reason: received notification from peer
> 
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.120 is down, reason: received notification from peer
> 
> Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x1008af801c6
> 
> Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.192 is down, reason: received notification from peer
> 
> Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x10046fa004e
> 
> 
> 
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
signal
> number 6. Core dumped!
> 
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
> 
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
> 
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
primary
> 
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> lost ifl 0 for route (null)
> 
> Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
> 
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
> 
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
L2circuit IFL
> Repository
> 
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
routing
> updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
> 
> 
> 
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041
> 
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039
> 
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038
> 
> 
> 
> The case is we always have this kind of log (except the Crash) on the
device. Is
> there any clue why RPD process crashed? I don't have access to JTAC so I
cannot
> analyze the dump.
> 
> The JunOS version is : 11.2R2.4
> 
> 
> 
> Thank you for your help and support
> 
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] RPD Crash on M320

2016-01-04 Thread Niall Donaghy
 

From your comments I understand there was no CPU spike, and traceoptions aren’t 
the cause either.

By this point* I would have raised a JTAC case for analysis of the core dump, 
and taken their lead.

 

* assuming you’ve checked all sources of information and found no clues as to 
the cause, ie: logfile analysis, resource exhaustion checks, analysis of 
config, eg: are you using suspected buggy features, or anything 
non-standard/complex/advanced?

 

We are running 14.1R5.5 on MX series and have lots of features turned on, and 
several workarounds in place. We have found a few bugs for JNPR...

 

Kind regards,

Niall

 

From: Alireza Soltanian [mailto:soltan...@gmail.com] 
Sent: 04 January 2016 15:18
To: Niall Donaghy
Cc: juniper-nsp@puck.nether.net
Subject: RE: [j-nsp] RPD Crash on M320

 

Just asking. Anyway any idea about my comments? Also is there any mechanism or 
approach for dealing with these kind of situations?

On Jan 4, 2016 6:45 PM, "Niall Donaghy" <niall.dona...@geant.org 
<mailto:niall.dona...@geant.org> > wrote:

Reading the core dump is beyond my expertise I’m afraid.

 

Br,

Niall

 

From: Alireza Soltanian [mailto:soltan...@gmail.com 
<mailto:soltan...@gmail.com> ] 
Sent: 04 January 2016 15:14
To: Niall Donaghy
Cc: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> 
Subject: RE: [j-nsp] RPD Crash on M320

 

Hi
Yes I checked the CPU graph and there was a spike on CPU load. 
The link was flappy 20 minutes before crash. Also it remained flappy two hours 
after this crash. During this time we can see LDP sessions go UP DOWN over and 
over. But the only time there was a crash was this time and there is no spike 
on CPU.
I must mention we had another issue with another M320. Whenever a link flapped, 
CPU of RPD went high and all OSPF sessions reset. I found out the root cause 
for that. It was traceoption for LDP. For this box we dont use traceoption.
Is there any way to read the dump?

Thank you

On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org 
<mailto:niall.dona...@geant.org> > wrote:

Hi Alireza,

It seemed to me this event could be related to the core dump: Jan  3
00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
message on non-active socket w/handle 0x10046fa004e
However upon further investigation
(http://kb.juniper.net/InfoCenter/index?page=content 
<http://kb.juniper.net/InfoCenter/index?page=content=KB18195> =KB18195) I 
see these
messages are normal/harmless.

Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
crash? Link flapping may be giving rise to CPU hogging, leading to
instability and subsequent rpd crash.
Was the link particularly flappy just before the crash?

Kind regards,
Niall




> -Original Message-
> From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net 
> <mailto:juniper-nsp-boun...@puck.nether.net> ] On Behalf
Of
> Alireza Soltanian
> Sent: 04 January 2016 11:04
> To: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> 
> Subject: [j-nsp] RPD Crash on M320
>
> Hi everybody
>
> Recently, we had continuous link flap between our M320 and remote sites.
We
> have a lot of L2Circuits between these sites on our M320. At one point we
had
> crash on RPD process which lead to following log. I must mention the link
flap
> started at 12:10AM and it was continued until 2:30AM. But Crash was
occurred
> at 12:30AM.
>
>
>
> Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.168 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.254.1 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.120 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x1008af801c6
>
> Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.192 is down, reason: received notification from peer
>
> Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x10046fa004e
>
>
>
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
signal
> number 6. Core dumped!
>
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
>
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
>
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
primary
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> lost ifl 0 for route (null)
>
> Jan  3 00:32:20  apa-rtr-028 last me

Re: [j-nsp] RPD Crash on M320

2016-01-04 Thread Wojciech Janiszewski
Hi,

11.2 is end of support so my guess is that's there no point in raising a
case. As first step  I'd  try the upgrade to some supported release and
then check if that helps.

Regards,
Wojciech
4 sty 2016 17:02 "Niall Donaghy" <niall.dona...@geant.org> napisał(a):

>
>
> From your comments I understand there was no CPU spike, and traceoptions
> aren’t the cause either.
>
> By this point* I would have raised a JTAC case for analysis of the core
> dump, and taken their lead.
>
>
>
> * assuming you’ve checked all sources of information and found no clues as
> to the cause, ie: logfile analysis, resource exhaustion checks, analysis of
> config, eg: are you using suspected buggy features, or anything
> non-standard/complex/advanced?
>
>
>
> We are running 14.1R5.5 on MX series and have lots of features turned on,
> and several workarounds in place. We have found a few bugs for JNPR...
>
>
>
> Kind regards,
>
> Niall
>
>
>
> From: Alireza Soltanian [mailto:soltan...@gmail.com]
> Sent: 04 January 2016 15:18
> To: Niall Donaghy
> Cc: juniper-nsp@puck.nether.net
> Subject: RE: [j-nsp] RPD Crash on M320
>
>
>
> Just asking. Anyway any idea about my comments? Also is there any
> mechanism or approach for dealing with these kind of situations?
>
> On Jan 4, 2016 6:45 PM, "Niall Donaghy" <niall.dona...@geant.org  niall.dona...@geant.org> > wrote:
>
> Reading the core dump is beyond my expertise I’m afraid.
>
>
>
> Br,
>
> Niall
>
>
>
> From: Alireza Soltanian [mailto:soltan...@gmail.com  soltan...@gmail.com> ]
> Sent: 04 January 2016 15:14
> To: Niall Donaghy
> Cc: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net>
> Subject: RE: [j-nsp] RPD Crash on M320
>
>
>
> Hi
> Yes I checked the CPU graph and there was a spike on CPU load.
> The link was flappy 20 minutes before crash. Also it remained flappy two
> hours after this crash. During this time we can see LDP sessions go UP DOWN
> over and over. But the only time there was a crash was this time and there
> is no spike on CPU.
> I must mention we had another issue with another M320. Whenever a link
> flapped, CPU of RPD went high and all OSPF sessions reset. I found out the
> root cause for that. It was traceoption for LDP. For this box we dont use
> traceoption.
> Is there any way to read the dump?
>
> Thank you
>
> On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org  niall.dona...@geant.org> > wrote:
>
> Hi Alireza,
>
> It seemed to me this event could be related to the core dump: Jan  3
> 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
> message on non-active socket w/handle 0x10046fa004e
> However upon further investigation
> (http://kb.juniper.net/InfoCenter/index?page=content <
> http://kb.juniper.net/InfoCenter/index?page=content=KB18195>
> =KB18195) I see these
> messages are normal/harmless.
>
> Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
> crash? Link flapping may be giving rise to CPU hogging, leading to
> instability and subsequent rpd crash.
> Was the link particularly flappy just before the crash?
>
> Kind regards,
> Niall
>
>
>
>
> > -Original Message-
> > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net  juniper-nsp-boun...@puck.nether.net> ] On Behalf
> Of
> > Alireza Soltanian
> > Sent: 04 January 2016 11:04
> > To: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net>
> > Subject: [j-nsp] RPD Crash on M320
> >
> > Hi everybody
> >
> > Recently, we had continuous link flap between our M320 and remote sites.
> We
> > have a lot of L2Circuits between these sites on our M320. At one point we
> had
> > crash on RPD process which lead to following log. I must mention the link
> flap
> > started at 12:10AM and it was continued until 2:30AM. But Crash was
> occurred
> > at 12:30AM.
> >
> >
> >
> > Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.168 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.254.1 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.120 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x1008af801c6
> >