On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote:
MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models. We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.
perhaps i'll experiment -- any clues as to what the orte
restrictions might be?
The main constraint is that you have to run a "persistent" orted that
will span all your MPI_COMM_WORLD's. We have only lightly tested
this scenario -- Ralph, can you comment more here?
- It also likely doesn't work yet; we started the integration work
and ran into a technical issue that required further discussion with
Platform. They're currently looking into it; we stopped the LSF work
in ORTE until they get back to us.
i see -- i might be trying to work on the 6.x support today. can you
give me any hints on what the problem was in case i run into the same
issue?
Something was wrong with the lsb_launch() function; using it caused a
significant slowdown in the job and it generally wasn't behaving as
expected. Platform issued a fix for me yesterday (i.e., a one-off/
unsupported binary for development purposes) that I haven't gotten to
test yet.
- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
offer a way out here. But I think a) THREAD_MULTIPLE isn't working
yet (other OMPI members are working on this), and b) even when
THREAD_MULTIPLE works, there will be ORTE issues to deal with
(canceling pending resource allocations, etc.). Ralph mentioned that
someone else is working on such things on the TM/PBS/Torque side; I
haven't followed that effort closely.
it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there
are perhaps other workarounds (using threads in other ways, etc.).
also, i'd love to hear about the existing efforts -- i'm hoping
someone working on them might be reading this ... ;)
Ralph -- can you chime in on the TM/PBS/Torque efforts?
--
Jeff Squyres
Cisco Systems