Re: [OMPI devel] PML selection logic

Jeff Squyres Sat, 28 Jun 2008 08:55:50 -0400

Agreed. I have a few ideas in this direction as well (random thoughtsthat might as well be transcribed somewhere):

- some kind of configure --enable-large-system (whatever) option is aGood Thing

- it would be good if the configure option simply set [MCA parameter?]defaults wherever possible (vs. #if-selecting code). I think one ofthe biggest lessons learned from Open MPI is that everyone's setup isdifferent -- having the ability to mix and match various run-timeoptions, while not widely used, is absolutely critical in somescenarios. So it might be good if --enable-large-system sets a bunchof default parameters that some sysadmins may still want/need tooverride.

- decision to run the modex: I haven't seen all of Ralph's work inthis area, but I wonder if it's similar to the MPI handle parameterchecks: it could be a multi-value MCA parameter, such as: "never","always", "when-ompi-determines-its-necessary", etc., where the lastvalue can use multiple criteria to know if it's necessary to do amodex (e.g., job size, when spawn occurs, whether the "pml" [or othercritical] MCA param[s] were specified, ...etc.).



On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote:

Just to complete this thread...
Brian raised a very good point, so we identified it on the weeklytelecon asa subject that really should be discussed at next week's technicalmeeting.I think we can find a reasonable answer, but there are several waysit canbe done. So rather than doing our usual piecemeal approach to thesolution,
it makes sense to begin talking about a more holistic design for
accommodating both needs.

Thanks Brian for pointing out the bigger picture.
Ralph



On 6/24/08 8:22 AM, "Brian W. Barrett" <[email protected]> wrote:
yeah, that could be a problem, but it's such a minority case andwe've got
to draw the line somewhere.

Of course, it seems like this is a never ending battle between two
opposing forces... The desire to do the "right thing" all the timeatsmall and medium scale and the desire to scale out to the "bigthing".It seems like in the quest to kill off the modex, we've run intothese
pretty often.
The modex doesn't hurt us at small scale (indeed, we're probably okwith
the routed communication pattern up to 512 nodes or so if we don't do
anything stupid, maybe further).  Is it time to admit defeat in this
argument and have a configure option that turns off the modex (atthe costof some of these correctness checks) for the large machines, butkeepsthings simple for the common case? I'm sure there are other thingswherethis will come up, so perhaps a --enable-large-scale? Maybe it's adumb
idea, but it seems like we've made a lot of compromises lately around
this, where no one ends up really happy with the solution :/.

Brian


On Tue, 24 Jun 2008, George Bosilca wrote:
Brian hinted a possible bug in one of his replies. How does thiswork in thecase of dynamic processes? We can envision several scenarios, butlets take asimple: 2 jobs that get connected with connect/accept. One mightpublish thePML name (simply because the -mca argument was on) and one mightnot?
george.

On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
Also sounds good to me.
Note that the most difficult part of the forward-looking plan isthat weusually can't tell the difference between "something failed toinitialize"
and "you don't have support for feature X".
I like the general philosophy of: running out of the box alwaysworks just
fine, but if you/the sysadmin is smart, you can get performance
improvements.


On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:
Okay, so let's explore an alternative that preserves thesupport you areseeking for the "ignorant user", but doesn't penalize everyoneelse.
What we
could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added
2. if it is not provided, then only rank=0 inserts the data.All other
procs
simply check their own selection against the one given by rank=0
Now, if a knowledgeable user or sys admin specifies what touse for
their
system, we won't penalize their startup time. A user whodoesn't know
what
to do gets to run, albeit less scalably on startup.
Looking forward from there, we can look to a day where failingto
initialize
something that exists on the system could be detected in someother
fashion,
letting the local proc abort since it would know that otherprocs thatdetected similar capabilities may well have selected that PML.For now,
though, this would solve the problem.

Make sense?
Ralph
On 6/23/08 1:31 PM, "Brian W. Barrett" <[email protected]>wrote:
The problem is that we default to OB1, but that's not theright choice
for
some platforms (like Pathscale / PSM), where there's a hugeperformancehit for using OB1. So we run into a situation where userinstalls OpenMPI, starts running, gets horrible performance, bad mouthsOpen MPI,
and
now we're in that game again. Yeah, the sys admin shouldknow what to
do,
but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:
My fault - I should be more precise in my language. ;-/
#1 is not adequate, IMHO, as it forces us to -always- do amodex. It
seems
to me that a simpler solution to what you describe is forthe user tospecify -mca pml ob1, or -mca pml cm. If the latter, thenyou could
deal
with the failed-to-initialize problem cleanly by having theproc
directly
abort.
Again, sometimes I think we attempt to automate too manythings. This
seems
like a pretty clear case where you know what you want - thesys admin,
if
nobody else, can certainly set that mca param in the defaultparam
file!
Otherwise, it seems to me that you are relying on the modexto detect
that
your proc failed to init the correct subsystem. I hate toforce a
modex just
for that - if so, then perhaps this could again be asettable option
to
avoid requiring non-scalable behavior for those of us who want
scalability?
On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret@open-mpi.org> wrote:
The selection code was added because frequently high speed
interconnects
fail to initialize properly due to random stuff happening(yes,
that's a
horrible statement, but true). We ran into a situationwith some
really
flaky machines where most of the processes would chose CM,but a
couple
would fail to initialize the MTL and therefore chose OB1.This lead
to a
hang situation, which is the worst of the worst.
I think #1 is adequate, although it doesn't handle spawnparticularlywell. And spawn is generally used in environments wheresuch network
mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:
Since my goal is to eliminate the modex completely formanagedinstallations, could you give me a brief understanding ofthis
eventual PML
selection logic? It would help to hear an example of howand why
different
procs could get different answers - and why we would wantto allow
them to
do so.

Thanks
Ralph
On 6/23/08 11:59 AM, "Aurélien Bouteiller" <[email protected]>
wrote:
The first approach sounds fair enough to me. We shouldavoid 2 and
3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a majordesign bug
in
the BTL selection process. When using the complete PMLselection,
BTL
would be initialized several times, leading to a varietyof bugs.Eventually the PML selection should return to its oldself, when
the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
Yo all
I've been doing further research into the modex and cameacross
something I
don't fully understand. It seems we have each processinsert into
the modex
the name of the PML module that it selected. Once themodex has
exchanged
that info, it then loops across all procs in the job tocheck
their
selection, and aborts if any proc picked a different PMLmodule.
All well and good...assuming that procs actually -can-choose
different PML
modules and hence create an "abort" scenario. However,if I look
inside the
PML's at their selection logic, I find that a proc canONLY pick a
module
other than ob1 if:
1. the user specifies the module to use via -mca pml xyzor by
using a
module specific mca param to adjust its priority. Inthis case,
since the
mca param is propagated, ALL procs have no choice but topick that
same
module, so that can't cause us to abort (we will havealready
returned an
error and aborted if the specified module can't run).
2. the pml/cm module detects that an MTL module wasselected, and
that it is
other than "psm". In this case, the CM module will beselected
because its
default priority is higher than that of OB1.
In looking deeper into the MTL selection logic, itappears to me
that you
either have the required capability or you don't. I cansee that
in
some
environments (e.g., rsh across unmanaged collections ofmachines),
it might
be possible for someone to launch across a set ofmachines where
some do and
some don't have the required support. However, in allother cases,
this will
be homogeneous across the system.
Given this analysis (and someone more familiar with thePML should
feel free
to confirm or correct it), it seems to me that thiscould be
streamlined via
one or more means:
1. at the most, we could have rank=0 add the PML modulename to
the
modex,
and other procs simply check it against their own andreturn an
error if
they differ. This accomplishes the identicalfunctionality to what
we have
today, but with much less info in the modex.
2. we could eliminate this info from the modexaltogether by
requiring the
user to specify the PML module if they want somethingother than
the
default
OB1. In this case, there can be no confusion over whateach proc
is
to use.
The CM module will attempt to init the MTL - if itcannot do so,
then the
job will return the correct error and tell the user thatCM/MTL
support is
unavailable.
3. we could again eliminate the info by not inserting itinto the
modex if
(a) the default PML module is selected, or (b) the userspecified
the PML
module to be used. In the first case, each proc cansimply check
to
see if
they picked the default - if not, then we can insert theinfo to
indicate
the difference. Thus, in the "standard" case, no infowill be
inserted.
In the second case, we will already get an error if thespecified
PML module
could not be used. Hence, the modex check provides noadditional
info or
value.
I understand the motivation to support automation.However, in
this
case,
the automation actually doesn't seem to buy us verymuch, and it
isn't
coming "free". So perhaps some change in how this isdone would be
in order?

Ralph



_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] PML selection logic

Reply via email to