Pawel Dziekonski wrote:
Hi,from time to time I get Catastrophic errors like below. software stack is kernel 2.6.18-92.1.10.el5 with Lustre client. device and OFED info is also below. any hints? thanks in advance, Pawel 06:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) # ibv_devices device node GUID ------ ---------------- mthca0 0030487e07700000 # ibv_devinfo hca_id: mthca0 fw_ver: 1.2.0 node_guid: 0030:487e:0770:0000 sys_image_guid: 0030:487e:0770:0003 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: SM_0000000003 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 441 port_lmc: 0x00 kernel: ib_mthca 0000:06:00.0: Catastrophic error detected: unknown error kernel: ib_mthca 0000:06:00.0: buf[00]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[01]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[02]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[03]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[04]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[05]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[06]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[07]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[08]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[09]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[0a]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[0b]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[0c]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[0d]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[0e]: ffffffff kernel: ib_mthca 0000:06:00.0: buf[0f]: ffffffff kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11) kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11) kernel: ib0: ib_detach_mcast failed (result = -11) kernel: ib0: ipoib_mcast_detach failed (result = -11) kernel: ib0: ib_detach_mcast failed (result = -11) kernel: ib0: ipoib_mcast_detach failed (result = -11) kernel: ib0: Failed to modify QP to ERROR state kernel: ib0: timing out; 0 sends 128 receives not completed kernel: ib0: Failed to modify QP to RESET state kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11) kernel: ib_mthca 0000:06:00.0: HW2SW_CQ failed (-11) kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11) kernel: ib_mthca 0000:06:00.0: HW2SW_SRQ failed (-11) kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11) kernel: ib_mthca 0000:01:00.0: Catastrophic error detected: internal parity error kernel: ib_mthca 0000:01:00.0: buf[00]: 05000000 kernel: ib_mthca 0000:01:00.0: buf[01]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[02]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[03]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[04]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[05]: 00127f2c kernel: ib_mthca 0000:01:00.0: buf[06]: 000a0056 kernel: ib_mthca 0000:01:00.0: buf[07]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[08]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[09]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[0a]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[0b]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[0c]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[0d]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[0e]: 00000000 kernel: ib_mthca 0000:01:00.0: buf[0f]: 00000000 kernel: ib0: ib_query_port failed
This is a known issue with Infinihost III HCA FW 1.2.0 Please contact Mellanox support to get an updated version for the FW Tziporet _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
