ok,
Having confirmed that replacing MPI_Abort with exit() does not work and 
checking that under these conditions the only process left running appears to 
be mpirun,
I think I need to report a bug, ie:
Although the processes themselves can be stopped (by exit if nothing else)
mpirun hangs after a node is powered off and can never exit as it appears to 
wait indefinitely for the missing node to receive or send a signal.


--- On Wed, 23/6/10, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 23 June, 2010, 9:10 PM

Open MPI's fault tolerance support is fairly rudimentary.  If you kill any 
process without calling MPI_Finalize, Open MPI will -- by default -- kill all 
the others in the job.

Various research work is ongoing to improve fault tolerance in Open MPI, but I 
don't know the state of it in terms of surviving a failed process.  I *think* 
that this kind of stuff is not ready for prime time, but I admit that this is 
not an area that I pay close attention to.



On Jun 23, 2010, at 3:08 AM, Randolph Pullen wrote:

> That is effectively what I have done by changing to the immediate 
> send/receive and waiting in a loop a finite number of times for the transfers 
> to complete - and calling MPI_abort if they do not complete in a set time.
> It is not clear how I can kill mpirun in a manner consistent with the API.
> Are you implying I should call exit() rather than MPI_abort?
> 
> --- On Wed, 23/6/10, David Zhang <solarbik...@gmail.com> wrote:
> 
> From: David Zhang <solarbik...@gmail.com>
> Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
> To: "Open MPI Users" <us...@open-mpi.org>
> Received: Wednesday, 23 June, 2010, 4:37 PM
> 
> Since you turned the machine off instead of just killing one of the 
> processes, no signals could be sent to other processes.  Perhaps you could 
> institute some sort of handshaking in your software that periodically check 
> for the attendance of all machines, and timeout if not all are present within 
> some alloted time?
> 
> On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen 
> <randolph_pul...@yahoo.com.au> wrote:
> 
> I have a mpi program that aggregates data from multiple sql systems.  It all 
> runs fine.  To test fault tolerance I switch one of the machines off while it 
> is running.  The result is always a hang, ie mpirun never completes.
>  
> To try and avoid this I have replaced the send and receive calls with 
> immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends 
> and receives but it makes no difference.
> My requirement is that all complete or mpirun exits with an error - no matter 
> where they are in their execution when a failure occurs.  This system must 
> continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
> remaining nodes.
> 
> I am running FC10, gcc 4.3.2 and openMPI 1.4.1
> 4G RAM, dual core intel all x86_64
> 
> 
> ===============================================================================================================
> The commands I have tried:
> mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from 
> tab"   
> 
> mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  
> "select * from tab"   
> 
> mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  
> ./ingsprinkle  test t3  "select * from tab"   
> 
> 
> ===============================================================================================================
> 
> The results:
> recv returned 0 with status 0
> waited  # 2000002 tiumes - now status is  0 flag is -1976147192
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
> with errorcode 5.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 29141 on
> node bd01 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
> [*** wait a long time ***]
> [bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: 
> Connection reset by peer (104)
> 
> ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly 
> terminate
> 
> 
> ===============================================================================================================
> 
> As you can see, my trap can signal an abort, the tcp layer can time out but 
> mpirun just keeps on running...
> 
> Any help greatly appreciated..
> Vlad
> 
> 
> 
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> David Zhang
> University of California, San Diego
> 
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>  _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



      

Reply via email to