[j-nsp] RPD Crash on M320
Hi everybody Recently, we had continuous link flap between our M320 and remote sites. We have a lot of L2Circuits between these sites on our M320. At one point we had crash on RPD process which lead to following log. I must mention the link flap started at 12:10AM and it was continued until 2:30AM. But Crash was occurred at 12:30AM. Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session 10.237.253.168 is down, reason: received notification from peer Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session 10.237.254.1 is down, reason: received notification from peer Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session 10.237.253.120 is down, reason: received notification from peer Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK message on non-active socket w/handle 0x1008af801c6 Jan 3 00:31:06 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session 10.237.253.192 is down, reason: received notification from peer Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK message on non-active socket w/handle 0x10046fa004e Jan 3 00:32:18 apa-rtr-028 init: routing (PID 42128) terminated by signal number 6. Core dumped! Jan 3 00:32:18 apa-rtr-028 init: routing (PID 18307) started Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for primary Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: lost ifl 0 for route (null) Jan 3 00:32:20 apa-rtr-028 last message repeated 65 times Jan 3 00:32:20 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary Jan 3 00:32:20 apa-rtr-028 rpd[18307]: Primary starts deleting all L2circuit IFL Repository Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing routing updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041 Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039 Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038 The case is we always have this kind of log (except the Crash) on the device. Is there any clue why RPD process crashed? I don't have access to JTAC so I cannot analyze the dump. The JunOS version is : 11.2R2.4 Thank you for your help and support ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] RPD Crash on M320
Just asking. Anyway any idea about my comments? Also is there any mechanism or approach for dealing with these kind of situations? On Jan 4, 2016 6:45 PM, "Niall Donaghy" <niall.dona...@geant.org> wrote: > Reading the core dump is beyond my expertise I’m afraid. > > > > Br, > > Niall > > > > *From:* Alireza Soltanian [mailto:soltan...@gmail.com] > *Sent:* 04 January 2016 15:14 > *To:* Niall Donaghy > *Cc:* juniper-nsp@puck.nether.net > *Subject:* RE: [j-nsp] RPD Crash on M320 > > > > Hi > Yes I checked the CPU graph and there was a spike on CPU load. > The link was flappy 20 minutes before crash. Also it remained flappy two > hours after this crash. During this time we can see LDP sessions go UP DOWN > over and over. But the only time there was a crash was this time and there > is no spike on CPU. > I must mention we had another issue with another M320. Whenever a link > flapped, CPU of RPD went high and all OSPF sessions reset. I found out the > root cause for that. It was traceoption for LDP. For this box we dont use > traceoption. > Is there any way to read the dump? > > Thank you > > On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org> wrote: > > Hi Alireza, > > It seemed to me this event could be related to the core dump: Jan 3 > 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x10046fa004e > However upon further investigation > (http://kb.juniper.net/InfoCenter/index?page=content=KB18195) I see > these > messages are normal/harmless. > > Do you have Cacti graphs of CPU utilisation for both REs, before the rpd > crash? Link flapping may be giving rise to CPU hogging, leading to > instability and subsequent rpd crash. > Was the link particularly flappy just before the crash? > > Kind regards, > Niall > > > > > > -Original Message- > > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf > Of > > Alireza Soltanian > > Sent: 04 January 2016 11:04 > > To: juniper-nsp@puck.nether.net > > Subject: [j-nsp] RPD Crash on M320 > > > > Hi everybody > > > > Recently, we had continuous link flap between our M320 and remote sites. > We > > have a lot of L2Circuits between these sites on our M320. At one point we > had > > crash on RPD process which lead to following log. I must mention the link > flap > > started at 12:10AM and it was continued until 2:30AM. But Crash was > occurred > > at 12:30AM. > > > > > > > > Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.168 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.254.1 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.120 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received > PRL > ACK > > message on non-active socket w/handle 0x1008af801c6 > > > > Jan 3 00:31:06 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.192 is down, reason: received notification from peer > > > > Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received > PRL > ACK > > message on non-active socket w/handle 0x10046fa004e > > > > > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 42128) terminated by > signal > > number 6. Core dumped! > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 18307) started > > > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for > primary > > > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for > primary > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: > > lost ifl 0 for route (null) > > > > Jan 3 00:32:20 apa-rtr-028 last message repeated 65 times > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for > primary > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: Primary starts deleting all > L2circuit IFL > > Repository > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing > routing > > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder > > > > > > > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > > 1329, ifAdminStatus up(1
Re: [j-nsp] RPD Crash on M320
Reading the core dump is beyond my expertise I’m afraid. Br, Niall From: Alireza Soltanian [mailto:soltan...@gmail.com] Sent: 04 January 2016 15:14 To: Niall Donaghy Cc: juniper-nsp@puck.nether.net Subject: RE: [j-nsp] RPD Crash on M320 Hi Yes I checked the CPU graph and there was a spike on CPU load. The link was flappy 20 minutes before crash. Also it remained flappy two hours after this crash. During this time we can see LDP sessions go UP DOWN over and over. But the only time there was a crash was this time and there is no spike on CPU. I must mention we had another issue with another M320. Whenever a link flapped, CPU of RPD went high and all OSPF sessions reset. I found out the root cause for that. It was traceoption for LDP. For this box we dont use traceoption. Is there any way to read the dump? Thank you On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org <mailto:niall.dona...@geant.org> > wrote: Hi Alireza, It seemed to me this event could be related to the core dump: Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK message on non-active socket w/handle 0x10046fa004e However upon further investigation (http://kb.juniper.net/InfoCenter/index?page=content <http://kb.juniper.net/InfoCenter/index?page=content=KB18195> =KB18195) I see these messages are normal/harmless. Do you have Cacti graphs of CPU utilisation for both REs, before the rpd crash? Link flapping may be giving rise to CPU hogging, leading to instability and subsequent rpd crash. Was the link particularly flappy just before the crash? Kind regards, Niall > -Original Message- > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net > <mailto:juniper-nsp-boun...@puck.nether.net> ] On Behalf Of > Alireza Soltanian > Sent: 04 January 2016 11:04 > To: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> > Subject: [j-nsp] RPD Crash on M320 > > Hi everybody > > Recently, we had continuous link flap between our M320 and remote sites. We > have a lot of L2Circuits between these sites on our M320. At one point we had > crash on RPD process which lead to following log. I must mention the link flap > started at 12:10AM and it was continued until 2:30AM. But Crash was occurred > at 12:30AM. > > > > Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.168 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.254.1 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.120 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x1008af801c6 > > Jan 3 00:31:06 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.192 is down, reason: received notification from peer > > Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x10046fa004e > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 42128) terminated by signal > number 6. Core dumped! > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 18307) started > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for primary > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: > lost ifl 0 for route (null) > > Jan 3 00:32:20 apa-rtr-028 last message repeated 65 times > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: Primary starts deleting all L2circuit IFL > Repository > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing routing > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder > > > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041 > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039 > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038 > > > > The case is we always have this kind of log (except the Crash) on the device. Is > there any clue why RPD process crashed? I don't have access to JTAC so I cannot > analyze the dump. > > The JunOS version is : 11.2R2.4 > > > > Thank you for your help and support &g
Re: [j-nsp] RPD Crash on M320
Hi Yes I checked the CPU graph and there was a spike on CPU load. The link was flappy 20 minutes before crash. Also it remained flappy two hours after this crash. During this time we can see LDP sessions go UP DOWN over and over. But the only time there was a crash was this time and there is no spike on CPU. I must mention we had another issue with another M320. Whenever a link flapped, CPU of RPD went high and all OSPF sessions reset. I found out the root cause for that. It was traceoption for LDP. For this box we dont use traceoption. Is there any way to read the dump? Thank you On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org> wrote: > Hi Alireza, > > It seemed to me this event could be related to the core dump: Jan 3 > 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x10046fa004e > However upon further investigation > (http://kb.juniper.net/InfoCenter/index?page=content=KB18195) I see > these > messages are normal/harmless. > > Do you have Cacti graphs of CPU utilisation for both REs, before the rpd > crash? Link flapping may be giving rise to CPU hogging, leading to > instability and subsequent rpd crash. > Was the link particularly flappy just before the crash? > > Kind regards, > Niall > > > > > > -Original Message- > > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf > Of > > Alireza Soltanian > > Sent: 04 January 2016 11:04 > > To: juniper-nsp@puck.nether.net > > Subject: [j-nsp] RPD Crash on M320 > > > > Hi everybody > > > > Recently, we had continuous link flap between our M320 and remote sites. > We > > have a lot of L2Circuits between these sites on our M320. At one point we > had > > crash on RPD process which lead to following log. I must mention the link > flap > > started at 12:10AM and it was continued until 2:30AM. But Crash was > occurred > > at 12:30AM. > > > > > > > > Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.168 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.254.1 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.120 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received > PRL > ACK > > message on non-active socket w/handle 0x1008af801c6 > > > > Jan 3 00:31:06 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.192 is down, reason: received notification from peer > > > > Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received > PRL > ACK > > message on non-active socket w/handle 0x10046fa004e > > > > > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 42128) terminated by > signal > > number 6. Core dumped! > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 18307) started > > > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for > primary > > > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for > primary > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: > > lost ifl 0 for route (null) > > > > Jan 3 00:32:20 apa-rtr-028 last message repeated 65 times > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for > primary > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: Primary starts deleting all > L2circuit IFL > > Repository > > > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing > routing > > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder > > > > > > > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > > 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041 > > > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > > 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039 > > > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > > 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038 > > > > > > > > The case is we always have this kind of log (except the Crash) on the > device. Is > > there any clue why RPD process crashed? I don't have access to JTAC so I > cannot > > analyze the dump. > > > > The JunOS version is : 11.2R2.4 > > > > > > > > Thank you for your help and support > > > > ___ > > juniper-nsp mailing list juniper-nsp@puck.nether.net > > https://puck.nether.net/mailman/listinfo/juniper-nsp > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] RPD Crash on M320
Hi Alireza, It seemed to me this event could be related to the core dump: Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK message on non-active socket w/handle 0x10046fa004e However upon further investigation (http://kb.juniper.net/InfoCenter/index?page=content=KB18195) I see these messages are normal/harmless. Do you have Cacti graphs of CPU utilisation for both REs, before the rpd crash? Link flapping may be giving rise to CPU hogging, leading to instability and subsequent rpd crash. Was the link particularly flappy just before the crash? Kind regards, Niall > -Original Message- > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net] On Behalf Of > Alireza Soltanian > Sent: 04 January 2016 11:04 > To: juniper-nsp@puck.nether.net > Subject: [j-nsp] RPD Crash on M320 > > Hi everybody > > Recently, we had continuous link flap between our M320 and remote sites. We > have a lot of L2Circuits between these sites on our M320. At one point we had > crash on RPD process which lead to following log. I must mention the link flap > started at 12:10AM and it was continued until 2:30AM. But Crash was occurred > at 12:30AM. > > > > Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.168 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.254.1 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.120 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x1008af801c6 > > Jan 3 00:31:06 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.192 is down, reason: received notification from peer > > Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x10046fa004e > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 42128) terminated by signal > number 6. Core dumped! > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 18307) started > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for primary > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: > lost ifl 0 for route (null) > > Jan 3 00:32:20 apa-rtr-028 last message repeated 65 times > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: Primary starts deleting all L2circuit IFL > Repository > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing routing > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder > > > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041 > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039 > > Jan 3 00:32:21 apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex > 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038 > > > > The case is we always have this kind of log (except the Crash) on the device. Is > there any clue why RPD process crashed? I don't have access to JTAC so I cannot > analyze the dump. > > The JunOS version is : 11.2R2.4 > > > > Thank you for your help and support > > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] RPD Crash on M320
From your comments I understand there was no CPU spike, and traceoptions aren’t the cause either. By this point* I would have raised a JTAC case for analysis of the core dump, and taken their lead. * assuming you’ve checked all sources of information and found no clues as to the cause, ie: logfile analysis, resource exhaustion checks, analysis of config, eg: are you using suspected buggy features, or anything non-standard/complex/advanced? We are running 14.1R5.5 on MX series and have lots of features turned on, and several workarounds in place. We have found a few bugs for JNPR... Kind regards, Niall From: Alireza Soltanian [mailto:soltan...@gmail.com] Sent: 04 January 2016 15:18 To: Niall Donaghy Cc: juniper-nsp@puck.nether.net Subject: RE: [j-nsp] RPD Crash on M320 Just asking. Anyway any idea about my comments? Also is there any mechanism or approach for dealing with these kind of situations? On Jan 4, 2016 6:45 PM, "Niall Donaghy" <niall.dona...@geant.org <mailto:niall.dona...@geant.org> > wrote: Reading the core dump is beyond my expertise I’m afraid. Br, Niall From: Alireza Soltanian [mailto:soltan...@gmail.com <mailto:soltan...@gmail.com> ] Sent: 04 January 2016 15:14 To: Niall Donaghy Cc: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> Subject: RE: [j-nsp] RPD Crash on M320 Hi Yes I checked the CPU graph and there was a spike on CPU load. The link was flappy 20 minutes before crash. Also it remained flappy two hours after this crash. During this time we can see LDP sessions go UP DOWN over and over. But the only time there was a crash was this time and there is no spike on CPU. I must mention we had another issue with another M320. Whenever a link flapped, CPU of RPD went high and all OSPF sessions reset. I found out the root cause for that. It was traceoption for LDP. For this box we dont use traceoption. Is there any way to read the dump? Thank you On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org <mailto:niall.dona...@geant.org> > wrote: Hi Alireza, It seemed to me this event could be related to the core dump: Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK message on non-active socket w/handle 0x10046fa004e However upon further investigation (http://kb.juniper.net/InfoCenter/index?page=content <http://kb.juniper.net/InfoCenter/index?page=content=KB18195> =KB18195) I see these messages are normal/harmless. Do you have Cacti graphs of CPU utilisation for both REs, before the rpd crash? Link flapping may be giving rise to CPU hogging, leading to instability and subsequent rpd crash. Was the link particularly flappy just before the crash? Kind regards, Niall > -Original Message- > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net > <mailto:juniper-nsp-boun...@puck.nether.net> ] On Behalf Of > Alireza Soltanian > Sent: 04 January 2016 11:04 > To: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> > Subject: [j-nsp] RPD Crash on M320 > > Hi everybody > > Recently, we had continuous link flap between our M320 and remote sites. We > have a lot of L2Circuits between these sites on our M320. At one point we had > crash on RPD process which lead to following log. I must mention the link flap > started at 12:10AM and it was continued until 2:30AM. But Crash was occurred > at 12:30AM. > > > > Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.168 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.254.1 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.120 is down, reason: received notification from peer > > Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x1008af801c6 > > Jan 3 00:31:06 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > 10.237.253.192 is down, reason: received notification from peer > > Jan 3 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x10046fa004e > > > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 42128) terminated by signal > number 6. Core dumped! > > Jan 3 00:32:18 apa-rtr-028 init: routing (PID 18307) started > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for primary > > Jan 3 00:32:18 apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for primary > > Jan 3 00:32:20 apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT: > lost ifl 0 for route (null) > > Jan 3 00:32:20 apa-rtr-028 last me
Re: [j-nsp] RPD Crash on M320
Hi, 11.2 is end of support so my guess is that's there no point in raising a case. As first step I'd try the upgrade to some supported release and then check if that helps. Regards, Wojciech 4 sty 2016 17:02 "Niall Donaghy" <niall.dona...@geant.org> napisał(a): > > > From your comments I understand there was no CPU spike, and traceoptions > aren’t the cause either. > > By this point* I would have raised a JTAC case for analysis of the core > dump, and taken their lead. > > > > * assuming you’ve checked all sources of information and found no clues as > to the cause, ie: logfile analysis, resource exhaustion checks, analysis of > config, eg: are you using suspected buggy features, or anything > non-standard/complex/advanced? > > > > We are running 14.1R5.5 on MX series and have lots of features turned on, > and several workarounds in place. We have found a few bugs for JNPR... > > > > Kind regards, > > Niall > > > > From: Alireza Soltanian [mailto:soltan...@gmail.com] > Sent: 04 January 2016 15:18 > To: Niall Donaghy > Cc: juniper-nsp@puck.nether.net > Subject: RE: [j-nsp] RPD Crash on M320 > > > > Just asking. Anyway any idea about my comments? Also is there any > mechanism or approach for dealing with these kind of situations? > > On Jan 4, 2016 6:45 PM, "Niall Donaghy" <niall.dona...@geant.org niall.dona...@geant.org> > wrote: > > Reading the core dump is beyond my expertise I’m afraid. > > > > Br, > > Niall > > > > From: Alireza Soltanian [mailto:soltan...@gmail.com soltan...@gmail.com> ] > Sent: 04 January 2016 15:14 > To: Niall Donaghy > Cc: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> > Subject: RE: [j-nsp] RPD Crash on M320 > > > > Hi > Yes I checked the CPU graph and there was a spike on CPU load. > The link was flappy 20 minutes before crash. Also it remained flappy two > hours after this crash. During this time we can see LDP sessions go UP DOWN > over and over. But the only time there was a crash was this time and there > is no spike on CPU. > I must mention we had another issue with another M320. Whenever a link > flapped, CPU of RPD went high and all OSPF sessions reset. I found out the > root cause for that. It was traceoption for LDP. For this box we dont use > traceoption. > Is there any way to read the dump? > > Thank you > > On Jan 4, 2016 6:34 PM, "Niall Donaghy" <niall.dona...@geant.org niall.dona...@geant.org> > wrote: > > Hi Alireza, > > It seemed to me this event could be related to the core dump: Jan 3 > 00:31:28 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK > message on non-active socket w/handle 0x10046fa004e > However upon further investigation > (http://kb.juniper.net/InfoCenter/index?page=content < > http://kb.juniper.net/InfoCenter/index?page=content=KB18195> > =KB18195) I see these > messages are normal/harmless. > > Do you have Cacti graphs of CPU utilisation for both REs, before the rpd > crash? Link flapping may be giving rise to CPU hogging, leading to > instability and subsequent rpd crash. > Was the link particularly flappy just before the crash? > > Kind regards, > Niall > > > > > > -Original Message- > > From: juniper-nsp [mailto:juniper-nsp-boun...@puck.nether.net juniper-nsp-boun...@puck.nether.net> ] On Behalf > Of > > Alireza Soltanian > > Sent: 04 January 2016 11:04 > > To: juniper-nsp@puck.nether.net <mailto:juniper-nsp@puck.nether.net> > > Subject: [j-nsp] RPD Crash on M320 > > > > Hi everybody > > > > Recently, we had continuous link flap between our M320 and remote sites. > We > > have a lot of L2Circuits between these sites on our M320. At one point we > had > > crash on RPD process which lead to following log. I must mention the link > flap > > started at 12:10AM and it was continued until 2:30AM. But Crash was > occurred > > at 12:30AM. > > > > > > > > Jan 3 00:31:04 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.168 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.254.1 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session > > 10.237.253.120 is down, reason: received notification from peer > > > > Jan 3 00:31:05 apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received > PRL > ACK > > message on non-active socket w/handle 0x1008af801c6 > >