Did "--mca mpi_preconnect_all 1" work?
I also face this problem "readv failed: connection time out" in the production
environment, and our engineer has reproduced this scenario at 20 nodes with
gigabye ethernet and limit one ethernet speed to 2MB/s, then a MPI_Isend &&
MPI_Recv ring that
Hi!
I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat
Linux x86_64.
Our MPI job sometimes hang and show follow error logs:
[btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
failed: Connection timed out (110)
I run a test like this:
ed "dead" if two
consecutive heartbeats are not seen.
Let me know how it works for you - it hasn't been extensively tested, but has
worked so far.
Ralph
On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote:
I mean killed the orted deamon process during the mpi job running , but the mpi
j
how it works for you - it hasn't been extensively tested, but has
worked so far.
Ralph
On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote:
I mean killed the orted deamon process during the mpi job running , but the mpi
job hang and could't notice one of it's rank failed.
> Date: Wed, 1 Apr 2
uestion: how to avoid a running mpi job
> hang if host or network failed or orted deamon killed?
>
> Is there a firewall somewhere ?
>
> Jerome
>
> Guanyinzhu wrote:
> > Hi!
> > I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on
> > Redhat L
Hi!
I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat
Linux x86_64.
I run a test like this: just killed the orted process and the job hung for a
long time (hang for 2~3 hours then I killed the job).
I have the follow questions:
when network failed or