Ira, I think our general recommendation is to reboot the machine once the HCA has reported catastrophic error, since the device is in the fatal state and wouldn't respond to any command from the host. However the gen-2 driver, i.e. ib_mthca, resets the HCA when it starts, so restarting the driver may serve you just fine (unless you have a persistent HW failure).
>From what you reported IPoIB doesn't seem to survive this, so it looks like you still have to reboot your machine. Regards, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ira Weiny Sent: Thursday, November 09, 2006 4:45 PM To: openib-general@openib.org Cc: Roland Dreier; Trent D'Hooge Subject: [openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery. We just had an "internal parity error" on a mellanox HCA. The HCA recovered. However, IPoIB did not fair as well. We are not sure of the details. What I have on the console is: 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: Catastrophic error detected: internal parity error 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[00]: 05000014 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[01]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[02]: 00196240 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[03]: 00126618 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[04]: 00206128 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[05]: 001d6ff8 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[06]: ffffffff 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[07]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[08]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[09]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0a]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0b]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0c]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0d]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0e]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0f]: 00000000 2006-11-09 15:20:05 divert: no divert_blk to free, ib0 not ethernet 2006-11-09 15:20:05 divert: no divert_blk to free, ib1 not ethernet ifconfig showed ib0 as "gone" (as in not listed). We tried to ifup ib0 and got: # zeus64 /root > ifup ib0 ib_ipoib ib_ipoib device ib0 does not seem to be present, delaying initialization. I then tried to unload the ib_ipoib module and that has hung for the last 15 min. I have run ibv_rc_pingpong and ib_rdma_bw through the node fine. ibstat and ibstatus and the switch show the link to be up. So it appears as though the card recovered fine. What can we do? :-/ Thanks, Ira _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general