Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

Leonardo Fialho Wed, 1 Oct 2008 12:08:41 -0400

Ram,

What is the name and version of the kernel module for your NIC? I haveexperimented some similar with my tg3 module. The error which appearedfor my was different:

[btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readvfailed: No route to host (113)


I solved it changing the following parameter in the linux kernel:

/sbin/ethtool -K eth0 tso off

Leonardo


Aurélien Bouteiller escribió:

If you have several network cards in your system, it can sometime getthe endpoints confused. Especially if you don't have the same numberof cards or don't use the same subnet for all "eth0, eth1". You shouldtry to restrict Open MPI to use only one of the available networks byusing the --mca btl_tcp_if_include ethx parameter to mpirun, where xis the network interface that is always connected to the same logicaland physical network on your machine.


Aurelien

Le 1 oct. 08 à 11:47, V. Ram a écrit :

I wrote earlier about one of my users running a third-party Fortran code
on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd crash
behavior.

Our cluster's nodes all have 2 single-core processors.  If this code is
run on 2 processors on 1 node, it runs seemingly fine.  However, if the
job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode), then
it crashes and gives messages like:

[node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
[node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
mca_btl_tcp_frag_recv: readv failed with errno=104

Essentially, if any network communication is involved, the job crashes
in this form.

I do have another user that runs his own MPI code on 10+ of these
processors for days at a time without issue, so I don't think it's
hardware.

The original code also runs fine across many networked nodes if the
architecture is x86-64 (also running OMPI 1.2.7).

We have also tried different Fortran compilers (both PathScale and
gfortran) and keep getting these crashes.

Are there any suggestions on how to figure out if it's a problem with
the code or the OMPI installation/software on the system? We have tried
"--debug-daemons" with no new/interesting information being revealed.
Is there a way to trap segfault messages or more detailed MPI
transaction information or anything else that could help diagnose this?

Thanks.
--
 V. Ram
 v_r_...@fastmail.fm

--
http://www.fastmail.fm - Same, same, but different...

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

Reply via email to