Re: [OMPI devel] How can I achieve node fail over

2010-01-12 Thread Ralph Castain
When the link fails, mpirun loses contact with the orted on that node. This causes the OOB to callback to the routed framework to see if this is a critical link. Since a link to a daemon -is- considered critical, a call is made to the errmgr framework indicating that a proc (in this case, a daem

Re: [OMPI devel] How can I achieve node fail over

2010-01-12 Thread Sai Sudheesh
Hi, I want to use OpenMPI in a context where the link failure has high probability. My intention is both...I also want to get an indepth understanding of the code to know what happens behind the scenes. Anybody have suggestions or methodologies to flollow

Re: [OMPI devel] How can I achieve node fail over

2010-01-11 Thread Ralph Castain
As Josh indicated, the current OMPI trunk won't do that at the moment. Josh and I are working on a side branch to integrate the OpenRCM methods with mpirun to provide an OMPI capability for those not running ORCM on their systems. What wasn't clear is your motivation. Are you trying to develop t

Re: [OMPI devel] How can I achieve node fail over

2010-01-11 Thread Sai Sudheesh
Hi Josh, First of all...thanks for your response.. There was some typos in my mail making it vague at some portions. Let me make the scenarios mentioned in the previous mail more elaborative. What I tried is as follows.

Re: [OMPI devel] How can I achieve node fail over

2010-01-11 Thread Josh Hursey
On Jan 6, 2010, at 9:04 AM, Sai Sudheesh wrote: Hi, Just about two months ago I started experimenting with OpenMPI. I found this piece of software very interesting. How can I make this software fault tolerant? Depends on what you mean my fault tolerant. :) As of no

[OMPI devel] How can I achieve node fail over

2010-01-06 Thread Sai Sudheesh
Hi, Just about two months ago I started experimenting with OpenMPI. I found this piece of software very interesting. How can I make this software fault tolerant? As of now I am running this software on two machines having quad core processors and fedora 10.