On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote:

MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.

perhaps i'll experiment -- any clues as to what the orte restrictions might be?

The main constraint is that you have to run a "persistent" orted that will span all your MPI_COMM_WORLD's. We have only lightly tested this scenario -- Ralph, can you comment more here?

- It also likely doesn't work yet; we started the integration work
and ran into a technical issue that required further discussion with
Platform.  They're currently looking into it; we stopped the LSF work
in ORTE until they get back to us.

i see -- i might be trying to work on the 6.x support today. can you
give me any hints on what the problem was in case i run into the same
issue?

Something was wrong with the lsb_launch() function; using it caused a significant slowdown in the job and it generally wasn't behaving as expected. Platform issued a fix for me yesterday (i.e., a one-off/ unsupported binary for development purposes) that I haven't gotten to test yet.

- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
offer a way out here.  But I think a) THREAD_MULTIPLE isn't working
yet (other OMPI members are working on this), and b) even when
THREAD_MULTIPLE works, there will be ORTE issues to deal with
(canceling pending resource allocations, etc.).  Ralph mentioned that
someone else is working on such things on the TM/PBS/Torque side; I
haven't followed that effort closely.

it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there
are perhaps other workarounds (using threads in other ways, etc.).
also, i'd love to hear about the existing efforts -- i'm hoping
someone working on them might be reading this ... ;)

Ralph -- can you chime in on the TM/PBS/Torque efforts?

--
Jeff Squyres
Cisco Systems

Reply via email to