The problem is that we default to OB1, but that's not the right choice for some platforms (like Pathscale / PSM), where there's a huge performance hit for using OB1. So we run into a situation where user installs Open MPI, starts running, gets horrible performance, bad mouths Open MPI, and now we're in that game again. Yeah, the sys admin should know what to do, but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:

My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
with the failed-to-initialize problem cleanly by having the proc directly
abort.

Again, sometimes I think we attempt to automate too many things. This seems
like a pretty clear case where you know what you want - the sys admin, if
nobody else, can certainly set that mca param in the default param file!

Otherwise, it seems to me that you are relying on the modex to detect that
your proc failed to init the correct subsystem. I hate to force a modex just
for that - if so, then perhaps this could again be a settable option to
avoid requiring non-scalable behavior for those of us who want scalability?


On 6/23/08 1:21 PM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote:

The selection code was added because frequently high speed interconnects
fail to initialize properly due to random stuff happening (yes, that's a
horrible statement, but true).  We ran into a situation with some really
flaky machines where most of the processes would chose CM, but a couple
would fail to initialize the MTL and therefore chose OB1.  This lead to a
hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn particularly
well.  And spawn is generally used in environments where such network
mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:

Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote:

The first approach sounds fair enough to me. We should avoid 2 and 3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in
the BTL selection process. When using the complete PML selection, BTL
would be initialized several times, leading to a variety of bugs.
Eventually the PML selection should return to its old self, when the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :

Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert into
the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I look
inside the
PML's at their selection logic, I find that a proc can ONLY pick a
module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to pick that
same
module, so that can't cause us to abort (we will have already
returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and
that it is
other than "psm". In this case, the CM module will be selected
because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me
that you
either have the required capability or you don't. I can see that in
some
environments (e.g., rsh across unmanaged collections of machines),
it might
be possible for someone to launch across a set of machines where
some do and
some don't have the required support. However, in all other cases,
this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should
feel free
to confirm or correct it), it seems to me that this could be
streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the
modex,
and other procs simply check it against their own and return an
error if
they differ. This accomplishes the identical functionality to what
we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by
requiring the
user to specify the PML module if they want something other than the
default
OB1. In this case, there can be no confusion over what each proc is
to use.
The CM module will attempt to init the MTL - if it cannot do so,
then the
job will return the correct error and tell the user that CM/MTL
support is
unavailable.

3. we could again eliminate the info by not inserting it into the
modex if
(a) the default PML module is selected, or (b) the user specified
the PML
module to be used. In the first case, each proc can simply check to
see if
they picked the default - if not, then we can insert the info to
indicate
the difference. Thus, in the "standard" case, no info will be
inserted.

In the second case, we will already get an error if the specified
PML module
could not be used. Hence, the modex check provides no additional
info or
value.

I understand the motivation to support automation. However, in this
case,
the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be
in order?

Ralph



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to