Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
It would be excellent if you could address this in 1.4.x  or provide an alernative as it is an important attribute in fault recovery, particularly with a large number of nodes where the MTBF is significantly lowered; - ie we can expect node failures from time to time. A bit of background: I am

Re: [OMPI users] Roadrunner blasts past the petaflop mark with Open MPI

2010-06-23 Thread Durga Choudhury
Hi Brad/others Sorry for waking this very stale thread, but I am researching the prospects of CellBE based supercomputing and I found this old email a promising lead. My question is: what was the reason for choosing to mix an x86 based AMD cores and PPC 970 based Cell? Was the Cell based computer

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Ralph Castain
mpirun is not an MPI process, so it makes no difference what your processes are doing wrt MPI_Abort or any other MPI function call. A quick glance thru the code shows that mpirun won't properly terminate under these conditions. It is waiting to hear that all daemons have terminated, and obvious

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
ok, Having confirmed that replacing MPI_Abort with exit() does not work and checking that under these conditions the only process left running appears to be mpirun, I think I need to report a bug, ie: Although the processes themselves can be stopped (by exit if nothing else) mpirun hangs after a

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Jeff Squyres
Open MPI's fault tolerance support is fairly rudimentary. If you kill any process without calling MPI_Finalize, Open MPI will -- by default -- kill all the others in the job. Various research work is ongoing to improve fault tolerance in Open MPI, but I don't know the state of it in terms of s

Re: [OMPI users] Highly variable performance

2010-06-23 Thread Jed Brown
Following up on this, I have partial resolution. The primary culprit appears to be stale files in a ramdisk non-uniformly distributed across the sockets, thus interactingly poorly with NUMA. The slow runs invariably have high numa_miss and numa_foreign counts. I still have trouble making it expl

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
That is effectively what I have done by changing to the immediate send/receive and waiting in a loop a finite number of times for the transfers to complete - and calling MPI_abort if they do not complete in a set time. It is not clear how I can kill mpirun in a manner consistent with the API. Are

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread David Zhang
Since you turned the machine off instead of just killing one of the processes, no signals could be sent to other processes. Perhaps you could institute some sort of handshaking in your software that periodically check for the attendance of all machines, and timeout if not all are present within so

[OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
I have a mpi program that aggregates data from multiple sql systems.  It all runs fine.  To test fault tolerance I switch one of the machines off while it is running.  The result is always a hang, ie mpirun never completes.   To try and avoid this I have replaced the send and receive calls with

[OMPI users] about OpenMPI user-directed fault tolerance

2010-06-23 Thread 王睿
Hi, all Does the present version of Open-MPI support user-directed, and communicator-driven fault tolerance similar to those implement in FT-MPI ? If so, which functions in Open-MPI API are relative to it? Thanks very much. Rui -- Rui Wang Institute of Computing Technology, CAS, Beijing, P.R