Hi!
I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat
Linux x86_64.
Our MPI job sometimes hang and show follow error logs:
[btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
failed: Connection timed out (110)
I run a test like this: wri
Did "--mca mpi_preconnect_all 1" work?
I also face this problem "readv failed: connection time out" in the production
environment, and our engineer has reproduced this scenario at 20 nodes with
gigabye ethernet and limit one ethernet speed to 2MB/s, then a MPI_Isend &&
MPI_Recv ring that mea
Hi!
I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat
Linux x86_64.
I run a test like this: just killed the orted process and the job hung for a
long time (hang for 2~3 hours then I killed the job).
I have the follow questions:
when network failed or
ner's question: how to avoid a running mpi job
> hang if host or network failed or orted deamon killed?
>
> Is there a firewall somewhere ?
>
> Jerome
>
> Guanyinzhu wrote:
> > Hi!
> > I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on
et me know how it works for you - it hasn't been extensively tested, but has
worked so far.
Ralph
On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote:
I mean killed the orted deamon process during the mpi job running , but the mpi
job hang and could't notice one of it's rank failed.
the orted "dead" if two
consecutive heartbeats are not seen.
Let me know how it works for you - it hasn't been extensively tested, but has
worked so far.
Ralph
On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote:
I mean killed the orted deamon process during the mpi job runn