Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-05-10 Thread Guanyinzhu
Did "--mca mpi_preconnect_all 1" work? I also face this problem "readv failed: connection time out" in the production environment, and our engineer has reproduced this scenario at 20 nodes with gigabye ethernet and limit one ethernet speed to 2MB/s, then a MPI_Isend && MPI_Recv ring that

[OMPI users] MPI_Recv hang because readv failed at mca_btl_tcp_frag_recv()

2010-05-05 Thread Guanyinzhu
Hi! I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat Linux x86_64. Our MPI job sometimes hang and show follow error logs: [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) I run a test like this:

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-02 Thread Guanyinzhu
ed "dead" if two consecutive heartbeats are not seen. Let me know how it works for you - it hasn't been extensively tested, but has worked so far. Ralph On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote: I mean killed the orted deamon process during the mpi job running , but the mpi j

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-02 Thread Guanyinzhu
how it works for you - it hasn't been extensively tested, but has worked so far. Ralph On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote: I mean killed the orted deamon process during the mpi job running , but the mpi job hang and could't notice one of it's rank failed. > Date: Wed, 1 Apr 2

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Guanyinzhu
uestion: how to avoid a running mpi job > hang if host or network failed or orted deamon killed? > > Is there a firewall somewhere ? > > Jerome > > Guanyinzhu wrote: > > Hi! > > I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on > > Redhat L

[OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Guanyinzhu
Hi! I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat Linux x86_64. I run a test like this: just killed the orted process and the job hung for a long time (hang for 2~3 hours then I killed the job). I have the follow questions: when network failed or