On Apr 22, 2009, at 11:43 PM, Tsung Han Shie wrote:
Unfortunately, after I thoroughly examined entire cluster, I found a
bad node with busted hard drive. That's the reason why this job
hanged.
Also, when this job is sent with one bad node among the machinefile,
neither the openmpi nor my program gives me any error messages.
That's why I can't find the reason for job hanged.
Interesting. Sorry OMPI didn't provide more diagnostics. :-\
Did you get the information that you needed about the OpenFabrics
optimization stuff?
Note that we released OMPI 1.3.2 yesterday that fixed the
mpi_leave_pinned stuff, but also note that the treatment of the
mpi_leave_pinned MCA parameter changed slightly. Please see this FAQ
entry for details:
http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2
Also, since you did apparently mean v1.1.3, note that that version is
ancient. Much has happened in Open MPI to improve scalability and
performance (and diagnostics!) since the 1.1 series. If it's possible
for you to upgrade, I encourage you to do so.
--
Jeff Squyres
Cisco Systems