Hi Jose,

On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote:
We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
Ethernet Controller (Copper) (rev 02)

This is the output for the Intel GigE NIC, you should look at the one for the Myricom NIC and the PCI bridge above it (lspci -t to see the tree).

bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-

PERR- status means no parity detected when receiving data. Looking at the PERR status of the PCI bridge on the other side will show if there was in corruption on that bus.

As a first step, you can see if you can reproduce errors with a simple test involving a single node at a time. You can run "gm_allsize --verify" on each machine: it will send packets to itself (loopback in the switch) and check for corruption. If you don't see errors after a while, that node is probably clean. If you see errors, you can look deeper at lspci output to see if it's a PCI problem. If you are using a riser card, you can try without.

I am not sure if openMPI has an option to enable debug checksum, but it would also be useful to see if it detects anything.

Additionally, if you know any software tool or methodology to check the
hardware/software, please, could you send us how to do it?

You may want to look at the FAQ on GM troubleshooting:
http://www.myri.com/cgi-bin/fom.pl?file=425

Additionally, you can send email to h...@myri.com to open a ticket.

Patrick

Reply via email to