It would be excellent if you could address this in 1.4.x or provide an
alernative as it is an important attribute in fault recovery, particularly with
a large number of nodes where the MTBF is significantly lowered; - ie we can
expect node failures from time to time.
A bit of background:
I am
Hi Brad/others
Sorry for waking this very stale thread, but I am researching the
prospects of CellBE based supercomputing and I found this old email a
promising lead.
My question is: what was the reason for choosing to mix an x86 based
AMD cores and PPC 970 based Cell? Was the Cell based computer
mpirun is not an MPI process, so it makes no difference what your processes are
doing wrt MPI_Abort or any other MPI function call.
A quick glance thru the code shows that mpirun won't properly terminate under
these conditions. It is waiting to hear that all daemons have terminated, and
obvious
ok,
Having confirmed that replacing MPI_Abort with exit() does not work and
checking that under these conditions the only process left running appears to
be mpirun,
I think I need to report a bug, ie:
Although the processes themselves can be stopped (by exit if nothing else)
mpirun hangs after a
Open MPI's fault tolerance support is fairly rudimentary. If you kill any
process without calling MPI_Finalize, Open MPI will -- by default -- kill all
the others in the job.
Various research work is ongoing to improve fault tolerance in Open MPI, but I
don't know the state of it in terms of s
Following up on this, I have partial resolution. The primary culprit
appears to be stale files in a ramdisk non-uniformly distributed across
the sockets, thus interactingly poorly with NUMA. The slow runs
invariably have high numa_miss and numa_foreign counts. I still have
trouble making it expl
That is effectively what I have done by changing to the immediate send/receive
and waiting in a loop a finite number of times for the transfers to complete -
and calling MPI_abort if they do not complete in a set time.
It is not clear how I can kill mpirun in a manner consistent with the API.
Are
Since you turned the machine off instead of just killing one of the
processes, no signals could be sent to other processes. Perhaps you could
institute some sort of handshaking in your software that periodically check
for the attendance of all machines, and timeout if not all are present
within so
I have a mpi program that aggregates data from multiple sql systems. It all
runs fine. To test fault tolerance I switch one of the machines off while it
is running. The result is always a hang, ie mpirun never completes.
To try and avoid this I have replaced the send and receive calls with
Hi, all
Does the present version of Open-MPI support user-directed, and
communicator-driven fault tolerance similar to those implement in FT-MPI ?
If so, which functions in Open-MPI API are relative to it?
Thanks very much.
Rui
--
Rui Wang
Institute of Computing Technology, CAS, Beijing, P.R
10 matches
Mail list logo