Jeff,
I agree with your viewpoint, principally about the "reachability". But...
Looking from the FT viewpoint, sometimes (or some FT architectures), wants to
recover an application process on other node different from the first. In this
case a new modex should be called. It's fine for coordina
Ralph and others,
I made two tests with the RML/OOB while a PML module (I know it is
trange, but I need it) waits for a message (orte_rml_recv_buffer(...)).
The first one was using the --enable-progress-threads and the second one
without this.
My test was:
Sent a message from an orted, i.e.
Who wants to win a Wii? (you know, to take home and give *to your
kids* -- yeah, that's it...)
The Open MPI community will be at SC'08 in force this year, featuring:
- Our usual Open MPI State of the Union BOF (Wednesday, 12:15-1:15pm,
room #14 on level 4)
- Drawings to win one of *6* Ninte
Mainly because we know that the RML and OOB are not thread safe? :-)
Seriously, we know that ORTE has thread safety issues, mostly in the
RML/OOB area, which is why we do not allow it to be used with
threading. You are responsible for thread locking above that layer, if
you intend to use th
If you look at the Dec meeting wiki, you will see that we are moving
quickly to a modex-less launch anyway. It won't be the default because
it requires pre-discovery of the cluster's network resources (for
which we will provide a tool or method), but it will help resolve some
of these probl
I'm not 100% sure, but this looks like the changeset that caused all
of IU's trunk MTT
runs last night to segfault... yes, all. :-(
Here's the magnitude of the problem:
http://www.open-mpi.org/mtt/index.php?do_redir=883
Note how pretty much everything was passing for 1.4a1r19979,
and everything f
Ralph,
Very good document.
About the MPI layer (in case of fault), my idea is to give to BML the
ability to handle BTL errors which occurs when a process die (and
probably have been migrated), discovering the new location. I think that
it is possible because the HNP request the restart for th
Ralph Castain wrote:
As has frequently been commented upon at one time or another, the
shared memory backing file can be quite huge. There used to be a
param for controlling this size, but I can't find it in 1.3 - or at
least, the name or method for controlling file size has morphed into