We have a MX480 with a pair of MS-MPC-128G service boards that are tied together as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.

Occasionally one of them, for no apparent reason, will go offline and then back online while logging in 'chassid' log:

CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, 
jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName PIC: MS-MPC-PIC @ 
3/0/*, jnxFruType 11, jnxFruSlot 3, jnxFruOfflineReason 2, jnxFruLastPowerOff 
1052212977, jnxFruLastPowerOn 1052213068)
(as well as a bunch of other stuff).

According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), -- offlined by 
button press"
But I know that nobody was in the room at the time of those incidents, so the button couldn't have been pressed.

I hadn't paid too much attention to this as it was only happening occasionally and was either one board or the other. But today there was a whole spate of such incidents (20 in less than 45 minutes) and at one point it took both MPCs off line at the same time (thus noticable service-interruptus ).

In the 'messages' log there are lines that correspond:

  /kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx 12 
reported a sb_state 32 = SBS_CANTRCVMORE
  /kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12 err 5
  /kernel: pfe_listener_disconnect: conn dropped: listener idx=7, 
tnpaddr=0x13010080, reason: generic peer error
  datapath-traced[3960]: datapath_traced_connection_event_handler: Disconnected 
from MSPMAND
  mspd[3958]: Removed PIC connection state for fpc=3 pic=0 session=0x827a180
  (FPC Slot 3, PIC Slot 0)  ms30 kernel: svcs_ms2_app_sigcore_exit: sending 
UKERN_ST_DOWN (pid=190, td=0xc00000000291f960, sig=6)
  (FPC Slot 3, PIC Slot 0)  ms30 mspsmd[178]: mspsmd_connection_shutdown: 
Unexpected shutdown of connection, try reconnecting.
  /kernel: if_pfe_services_health_status: Generating Health status (down) msg 
for ifd : ms-3/0/0
  /kernel: if_pfe_services_health_status: Generating health status (down) for 
AMS member mams-3/0/0
  /kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev = 
AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE, 
member_present_count = 2
  /kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
  /kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle ams0.1
  /kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev = 
AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD, 
member_present_count = 2
  /kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return NULL
  last message repeated 4 times
  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1), 
ifOperStatus down(2), ifName ms-3/0/0.0
  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1), 
ifOperStatus down(2), ifName mams-3/0/0.1
  (FPC Slot 3, PIC Slot 0)  ms30 kernel: msgring_drain_process: bind thread to 
hwtid (5) cpuid(5)
  (FPC Slot 3, PIC Slot 0)  ms30 kernel: Kmernel thread "msgdrainthr5" (pid 
21832) exited prematurely.

Usually it runs for days at a time with out a single one of these incidents.
So I cannot tell if I've got a hardware flakey or a software bug that is being triggered by some external events.

Any suggestions? (other than opening a jtac case).

--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Reply via email to