thank you very much!


The option -mca orte_heartbeat_rate N is very usefull do detect failures like 
host or network failed or orted deamon killed for the running mpi job.



I have another question:

I use ssh for openmpi remote connect, but sometimes a host doesn't answer ssh 
login request,  but answer ping, maybe because of os . If this "error" host in 
the hostfile, the "mpirun -hostfile..." command would hang even I set -mca 
orte_heartbeat_rate 5 , are there any other options to avoid this? 





thanks a lot!



From: r...@lanl.gov
To: us...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Wed, 1 Apr 2009 07:34:46 -0600
Subject: Re: [OMPI users] Beginner's question: how to avoid a running mpi job 
hang if host or network failed or orted deamon killed?

There is indeed a heartbeat mechanism you can use - it is "off" by default. You 
can set it to check every N seconds with:


-mca orte_heartbeat_rate N


on your command line. Or if you want it to always run, add "orte_heartbeat_rate 
= N" to your default MCA param file. OMPI will declare the orted "dead" if two 
consecutive heartbeats are not seen.


Let me know how it works for you - it hasn't been extensively tested, but has 
worked so far.
Ralph



On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote:

I mean killed the orted deamon process during the mpi job running , but the mpi 
job hang and could't notice one of it's rank failed.




> Date: Wed, 1 Apr 2009 19:09:34 +0800
> From: ml.jgmben...@mailsnare.net
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Beginner's question: how to avoid a running mpi job 
> hang if host or network failed or orted deamon killed?
> 
> Is there a firewall somewhere ?
> 
> Jerome
> 
> Guanyinzhu wrote:
> > Hi! 
> > I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on 
> > Redhat Linux x86_64. 
> > 
> > I run a test like this: just killed the orted process and the job hung 
> > for a long time (hang for 2~3 hours then I killed the job).
> > 
> > I have the follow questions:
> > 
> > when network failed or host failed or orted deamon was killed by 
> > accident, How long would the running mpi job notice and exit? 
> > 
> > Does OpenMPI support a heartbeat mechanism or how c! ould I fast 
> > detect the failture to avoid the mpi job hang?
> > 
> > 
> > thanks a lot!
> > 
> > 
> > ------------------------------------------------------------------------
> > ?MSN????,??????????! ????! <http://mobile.msn.com.cn/>
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



更多热辣资讯尽在新版MSN首页! 立刻访问! _______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_________________________________________________________________
Live Search视频搜索,快速检索视频的利器!
http://www.live.com/?scope=video

Reply via email to