On Thu, Jul 23, 2009 at 3:03 PM, Ralph Castain <r...@open-mpi.org> wrote:

> It depends on which network fails. If you lose all TCP connectivity, Open
> MPI should abort the job as the out-of-band system will detect the loss of
> connection. If you only lose the MPI connection (whether TCP or some other
> interconnect), then I believe the system will eventually generate an error
> after it retries sending the message a specified number of times, though it
> may not abort.
>
>
Thank you Ralph,

>From your reply I came to know that the question I posted earlier was not
reflecting the problem properly.

I can't use blocking communication routines in my main program (
"masterprocess") because any type of network failure( may be due to physical
connectivity or TCP connectivity or MPI connection as you told) may occur.
So I am using non blocking point to point communication routines, and TEST
later for completion of that Request. Once I enter a TEST loop I will test
for Request complition till TIMEOUT. Suppose TIMEOUT has occured, In this
case first I will check whether

 1:  Slave machine is reachable or not,  (How I will do that ??? Given - I
have IP address and Host Name of Slave machine.)

 2:  if reachable, check whether program(orted and "slaveprocess") is alive
or not.

I don't want to abort my master process in case 1 and hope that network
connection will come up in future. Fortunately OpenMPI doesn't abort any
process. Both processes can run independently without communicating.


Thanks and Regards,
-- 
Vipin K.
Research Engineer,
C-DOTB, India

Reply via email to