On Thu, Jul 23, 2009 at 3:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
> It depends on which network fails. If you lose all TCP connectivity, Open > MPI should abort the job as the out-of-band system will detect the loss of > connection. If you only lose the MPI connection (whether TCP or some other > interconnect), then I believe the system will eventually generate an error > after it retries sending the message a specified number of times, though it > may not abort. > > Thank you Ralph, >From your reply I came to know that the question I posted earlier was not reflecting the problem properly. I can't use blocking communication routines in my main program ( "masterprocess") because any type of network failure( may be due to physical connectivity or TCP connectivity or MPI connection as you told) may occur. So I am using non blocking point to point communication routines, and TEST later for completion of that Request. Once I enter a TEST loop I will test for Request complition till TIMEOUT. Suppose TIMEOUT has occured, In this case first I will check whether 1: Slave machine is reachable or not, (How I will do that ??? Given - I have IP address and Host Name of Slave machine.) 2: if reachable, check whether program(orted and "slaveprocess") is alive or not. I don't want to abort my master process in case 1 and hope that network connection will come up in future. Fortunately OpenMPI doesn't abort any process. Both processes can run independently without communicating. Thanks and Regards, -- Vipin K. Research Engineer, C-DOTB, India