Re: [j-nsp] MX480 MS-MPC-128G CHASSISD_SNMP_TRAP10 jnxFruOfflineReason 8 but no button press

2017-02-08 Thread Sebastian Becker
Hi Dave,

We had such an issue with the PTX and it turned out they had some bad quality 
of the buttons so that the normal shaking from the fan trays can lead to a 
button press. You need to go to the JTAC for further investigation.

-- 
Sebastian Becker
s...@lab.dtag.de

> Am 08.02.2017 um 22:31 schrieb Michael Gehrmann :
> 
> 
> Hi David,
> 
> Might be worth checking for core dumps. I'd also do a PR search for and
> check on release notes for later releases. I have previously found on rare
> occasion MS cards can get into weird corner cases which normally involve
> JTAC to resolve.
> 
> Regards
> Mike
> 
> On 9 February 2017 at 14:14, David B Funk  >
> wrote:
> 
>> We have a MX480 with a pair of MS-MPC-128G service boards that are tied
>> together as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.
>> 
>> Occasionally one of them, for no apparent reason, will go offline and then
>> back online while logging in 'chassid' log:
>> 
>> CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on
>> (jnxFruContentsIndex 8, jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0,
>> jnxFruName PIC: MS-MPC-PIC @ 3/0/*, jnxFruType 11, jnxFruSlot 3,
>> jnxFruOfflineReason 2, jnxFruLastPowerOff 1052212977, jnxFruLastPowerOn
>> 1052213068)
>> (as well as a bunch of other stuff).
>> 
>> According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), --
>> offlined by button press"
>> But I know that nobody was in the room at the time of those incidents, so
>> the button couldn't have been pressed.
>> 
>> I hadn't paid too much attention to this as it was only happening
>> occasionally and was either one board or the other. But today there was a
>> whole spate of such incidents (20 in less than 45 minutes) and at one point
>> it took both MPCs off line at the same time (thus noticable
>> service-interruptus ).
>> 
>> In the 'messages' log there are lines that correspond:
>> 
>>  /kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx
>> 12 reported a sb_state 32 = SBS_CANTRCVMORE
>>  /kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12
>> err 5
>>  /kernel: pfe_listener_disconnect: conn dropped: listener idx=7,
>> tnpaddr=0x13010080, reason: generic peer error
>>  datapath-traced[3960]: datapath_traced_connection_event_handler:
>> Disconnected from MSPMAND
>>  mspd[3958]: Removed PIC connection state for fpc=3 pic=0
>> session=0x827a180
>>  (FPC Slot 3, PIC Slot 0)  ms30 kernel: svcs_ms2_app_sigcore_exit:
>> sending UKERN_ST_DOWN (pid=190, td=0xc291f960, sig=6)
>>  (FPC Slot 3, PIC Slot 0)  ms30 mspsmd[178]: mspsmd_connection_shutdown:
>> Unexpected shutdown of connection, try reconnecting.
>>  /kernel: if_pfe_services_health_status: Generating Health status (down)
>> msg for ifd : ms-3/0/0
>>  /kernel: if_pfe_services_health_status: Generating health status (down)
>> for AMS member mams-3/0/0
>>  /kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev =
>> AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE,
>> member_present_count = 2
>>  /kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
>>  /kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle
>> ams0.1
>>  /kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev =
>> AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD,
>> member_present_count = 2
>>  /kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return
>> NULL
>>  last message repeated 4 times
>>  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1),
>> ifOperStatus down(2), ifName ms-3/0/0.0
>>  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1),
>> ifOperStatus down(2), ifName mams-3/0/0.1
>>  (FPC Slot 3, PIC Slot 0)  ms30 kernel: msgring_drain_process: bind
>> thread to hwtid (5) cpuid(5)
>>  (FPC Slot 3, PIC Slot 0)  ms30 kernel: Kmernel thread "msgdrainthr5"
>> (pid 21832) exited prematurely.
>> 
>> Usually it runs for days at a time with out a single one of these
>> incidents.
>> So I cannot tell if I've got a hardware flakey or a software bug that is
>> being triggered by some external events.
>> 
>> Any suggestions? (other than opening a jtac case).
>> 
>> --
>> Dave Funk  University of Iowa
>> College of Engineering
>> 319/335-5751   FAX: 319/384-0549   1256 Seamans Center
>> Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
>> #include 
>> Better is not better, 'standard' is better. B{
>> ___
>> juniper-nsp mailing list juniper-nsp@puck.nether.net
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__puck.ne 
>> 
>> ther.net_mailman_listinfo_juniper-2Dnsp=DwICAg=wBUwXtM9s
>> Khff6UeHOQgvw=iCARHrCSMVMu5fNENyuQGdvoQJpwI5WIbiqe9jFEMFg&
>> 

Re: [j-nsp] MX480 MS-MPC-128G CHASSISD_SNMP_TRAP10 jnxFruOfflineReason 8 but no button press

2017-02-08 Thread Michael Gehrmann
Hi David,

Might be worth checking for core dumps. I'd also do a PR search for and
check on release notes for later releases. I have previously found on rare
occasion MS cards can get into weird corner cases which normally involve
JTAC to resolve.

Regards
Mike

On 9 February 2017 at 14:14, David B Funk 
wrote:

> We have a MX480 with a pair of MS-MPC-128G service boards that are tied
> together as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.
>
> Occasionally one of them, for no apparent reason, will go offline and then
> back online while logging in 'chassid' log:
>
> CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on
> (jnxFruContentsIndex 8, jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0,
> jnxFruName PIC: MS-MPC-PIC @ 3/0/*, jnxFruType 11, jnxFruSlot 3,
> jnxFruOfflineReason 2, jnxFruLastPowerOff 1052212977, jnxFruLastPowerOn
> 1052213068)
> (as well as a bunch of other stuff).
>
> According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), --
> offlined by button press"
> But I know that nobody was in the room at the time of those incidents, so
> the button couldn't have been pressed.
>
> I hadn't paid too much attention to this as it was only happening
> occasionally and was either one board or the other. But today there was a
> whole spate of such incidents (20 in less than 45 minutes) and at one point
> it took both MPCs off line at the same time (thus noticable
> service-interruptus ).
>
> In the 'messages' log there are lines that correspond:
>
>   /kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx
> 12 reported a sb_state 32 = SBS_CANTRCVMORE
>   /kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12
> err 5
>   /kernel: pfe_listener_disconnect: conn dropped: listener idx=7,
> tnpaddr=0x13010080, reason: generic peer error
>   datapath-traced[3960]: datapath_traced_connection_event_handler:
> Disconnected from MSPMAND
>   mspd[3958]: Removed PIC connection state for fpc=3 pic=0
> session=0x827a180
>   (FPC Slot 3, PIC Slot 0)  ms30 kernel: svcs_ms2_app_sigcore_exit:
> sending UKERN_ST_DOWN (pid=190, td=0xc291f960, sig=6)
>   (FPC Slot 3, PIC Slot 0)  ms30 mspsmd[178]: mspsmd_connection_shutdown:
> Unexpected shutdown of connection, try reconnecting.
>   /kernel: if_pfe_services_health_status: Generating Health status (down)
> msg for ifd : ms-3/0/0
>   /kernel: if_pfe_services_health_status: Generating health status (down)
> for AMS member mams-3/0/0
>   /kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev =
> AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE,
> member_present_count = 2
>   /kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
>   /kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle
> ams0.1
>   /kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev =
> AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD,
> member_present_count = 2
>   /kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return
> NULL
>   last message repeated 4 times
>   mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1),
> ifOperStatus down(2), ifName ms-3/0/0.0
>   mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1),
> ifOperStatus down(2), ifName mams-3/0/0.1
>   (FPC Slot 3, PIC Slot 0)  ms30 kernel: msgring_drain_process: bind
> thread to hwtid (5) cpuid(5)
>   (FPC Slot 3, PIC Slot 0)  ms30 kernel: Kmernel thread "msgdrainthr5"
> (pid 21832) exited prematurely.
>
> Usually it runs for days at a time with out a single one of these
> incidents.
> So I cannot tell if I've got a hardware flakey or a software bug that is
> being triggered by some external events.
>
> Any suggestions? (other than opening a jtac case).
>
> --
> Dave Funk  University of Iowa
> College of Engineering
> 319/335-5751   FAX: 319/384-0549   1256 Seamans Center
> Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
> #include 
> Better is not better, 'standard' is better. B{
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://urldefense.proofpoint.com/v2/url?u=https-3A__puck.ne
> ther.net_mailman_listinfo_juniper-2Dnsp=DwICAg=wBUwXtM9s
> Khff6UeHOQgvw=iCARHrCSMVMu5fNENyuQGdvoQJpwI5WIbiqe9jFEMFg&
> m=XA7G1eLizI_SB_PtEfaugLI3dfFDoy-OpLfVObS3k2s=8_SDm_
> ZHLrndQoPMH2Xuvf0V2n-l-UiOloc3VthxWHY=




-- 
Michael Gehrmann
Senior Network Engineer - Atlassian
m: +61 407 570 658
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


[j-nsp] MX480 MS-MPC-128G CHASSISD_SNMP_TRAP10 jnxFruOfflineReason 8 but no button press

2017-02-08 Thread David B Funk
We have a MX480 with a pair of MS-MPC-128G service boards that are tied together 
as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.


Occasionally one of them, for no apparent reason, will go offline and then back 
online while logging in 'chassid' log:


CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, 
jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName PIC: MS-MPC-PIC @ 
3/0/*, jnxFruType 11, jnxFruSlot 3, jnxFruOfflineReason 2, jnxFruLastPowerOff 
1052212977, jnxFruLastPowerOn 1052213068)
(as well as a bunch of other stuff).

According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), -- offlined by 
button press"
But I know that nobody was in the room at the time of those incidents, so the 
button couldn't have been pressed.


I hadn't paid too much attention to this as it was only happening occasionally 
and was either one board or the other. But today there was a whole spate of such 
incidents (20 in less than 45 minutes) and at one point it took both MPCs off 
line at the same time (thus noticable service-interruptus ).


In the 'messages' log there are lines that correspond:

  /kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx 12 
reported a sb_state 32 = SBS_CANTRCVMORE
  /kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12 err 5
  /kernel: pfe_listener_disconnect: conn dropped: listener idx=7, 
tnpaddr=0x13010080, reason: generic peer error
  datapath-traced[3960]: datapath_traced_connection_event_handler: Disconnected 
from MSPMAND
  mspd[3958]: Removed PIC connection state for fpc=3 pic=0 session=0x827a180
  (FPC Slot 3, PIC Slot 0)  ms30 kernel: svcs_ms2_app_sigcore_exit: sending 
UKERN_ST_DOWN (pid=190, td=0xc291f960, sig=6)
  (FPC Slot 3, PIC Slot 0)  ms30 mspsmd[178]: mspsmd_connection_shutdown: 
Unexpected shutdown of connection, try reconnecting.
  /kernel: if_pfe_services_health_status: Generating Health status (down) msg 
for ifd : ms-3/0/0
  /kernel: if_pfe_services_health_status: Generating health status (down) for 
AMS member mams-3/0/0
  /kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev = 
AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE, 
member_present_count = 2
  /kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
  /kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle ams0.1
  /kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev = 
AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD, 
member_present_count = 2
  /kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return NULL
  last message repeated 4 times
  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1), 
ifOperStatus down(2), ifName ms-3/0/0.0
  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1), 
ifOperStatus down(2), ifName mams-3/0/0.1
  (FPC Slot 3, PIC Slot 0)  ms30 kernel: msgring_drain_process: bind thread to 
hwtid (5) cpuid(5)
  (FPC Slot 3, PIC Slot 0)  ms30 kernel: Kmernel thread "msgdrainthr5" (pid 
21832) exited prematurely.

Usually it runs for days at a time with out a single one of these incidents.
So I cannot tell if I've got a hardware flakey or a software bug that is being triggered 
by some external events.


Any suggestions? (other than opening a jtac case).

--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp