Re: [OMPI devel] PML selection logic

2008-06-29 Thread Lenny Verkhovsky
We can also make few different paramfiles for typical setups ( large cluster
/ minimum LT / max BW e.t.c )
the desired paramfile can be chosen by configure flag and be placed in *
$prefix/etc/openmpi-mca-params.conf*

On Sat, Jun 28, 2008 at 3:55 PM, Jeff Squyres  wrote:

> Agreed.  I have a few ideas in this direction as well (random thoughts that
> might as well be transcribed somewhere):
>
> - some kind of configure --enable-large-system (whatever) option is a Good
> Thing
>
> - it would be good if the configure option simply set [MCA parameter?]
> defaults wherever possible (vs. #if-selecting code).  I think one of the
> biggest lessons learned from Open MPI is that everyone's setup is different
> -- having the ability to mix and match various run-time options, while not
> widely used, is absolutely critical in some scenarios.  So it might be good
> if --enable-large-system sets a bunch of default parameters that some
> sysadmins may still want/need to override.
>
> - decision to run the modex: I haven't seen all of Ralph's work in this
> area, but I wonder if it's similar to the MPI handle parameter checks: it
> could be a multi-value MCA parameter, such as: "never", "always",
> "when-ompi-determines-its-necessary", etc., where the last value can use
> multiple criteria to know if it's necessary to do a modex (e.g., job size,
> when spawn occurs, whether the "pml" [or other critical] MCA param[s] were
> specified, ...etc.).
>
>
>
> On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote:
>
> Just to complete this thread...
>>
>> Brian raised a very good point, so we identified it on the weekly telecon
>> as
>> a subject that really should be discussed at next week's technical
>> meeting.
>> I think we can find a reasonable answer, but there are several ways it can
>> be done. So rather than doing our usual piecemeal approach to the
>> solution,
>> it makes sense to begin talking about a more holistic design for
>> accommodating both needs.
>>
>> Thanks Brian for pointing out the bigger picture.
>> Ralph
>>
>>
>>
>> On 6/24/08 8:22 AM, "Brian W. Barrett"  wrote:
>>
>> yeah, that could be a problem, but it's such a minority case and we've got
>>> to draw the line somewhere.
>>>
>>> Of course, it seems like this is a never ending battle between two
>>> opposing forces...  The desire to do the "right thing" all the time at
>>> small and medium scale and the desire to scale out to the "big thing".
>>> It seems like in the quest to kill off the modex, we've run into these
>>> pretty often.
>>>
>>> The modex doesn't hurt us at small scale (indeed, we're probably ok with
>>> the routed communication pattern up to 512 nodes or so if we don't do
>>> anything stupid, maybe further).  Is it time to admit defeat in this
>>> argument and have a configure option that turns off the modex (at the
>>> cost
>>> of some of these correctness checks) for the large machines, but keeps
>>> things simple for the common case?  I'm sure there are other things where
>>> this will come up, so perhaps a --enable-large-scale?  Maybe it's a dumb
>>> idea, but it seems like we've made a lot of compromises lately around
>>> this, where no one ends up really happy with the solution :/.
>>>
>>> Brian
>>>
>>>
>>> On Tue, 24 Jun 2008, George Bosilca wrote:
>>>
>>> Brian hinted a possible bug in one of his replies. How does this work in
 the
 case of dynamic processes? We can envision several scenarios, but lets
 take a
 simple: 2 jobs that get connected with connect/accept. One might publish
 the
 PML name (simply because the -mca argument was on) and one might not?

 george.

 On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:

 Also sounds good to me.
>
> Note that the most difficult part of the forward-looking plan is that
> we
> usually can't tell the difference between "something failed to
> initialize"
> and "you don't have support for feature X".
>
> I like the general philosophy of: running out of the box always works
> just
> fine, but if you/the sysadmin is smart, you can get performance
> improvements.
>
>
> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>
> I concur
>> - galen
>>
>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>
>> That sounds like a reasonable plan to me.
>>>
>>> Brian
>>>
>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>
>>> Okay, so let's explore an alternative that preserves the support you
 are
 seeking for the "ignorant user", but doesn't penalize everyone else.
 What we
 could do is simply set things up so that:

 1. if -mca plm xyz is provided, then no modex data is added

 2. if it is not provided, then only rank=0 inserts the data. All
 other
 procs
 simply check their own selection against the one given by rank=0

 Now, if a knowledg

Re: [OMPI devel] PML selection logic

2008-06-28 Thread Jeff Squyres
Agreed.  I have a few ideas in this direction as well (random thoughts  
that might as well be transcribed somewhere):


- some kind of configure --enable-large-system (whatever) option is a  
Good Thing


- it would be good if the configure option simply set [MCA parameter?]  
defaults wherever possible (vs. #if-selecting code).  I think one of  
the biggest lessons learned from Open MPI is that everyone's setup is  
different -- having the ability to mix and match various run-time  
options, while not widely used, is absolutely critical in some  
scenarios.  So it might be good if --enable-large-system sets a bunch  
of default parameters that some sysadmins may still want/need to  
override.


- decision to run the modex: I haven't seen all of Ralph's work in  
this area, but I wonder if it's similar to the MPI handle parameter  
checks: it could be a multi-value MCA parameter, such as: "never",  
"always", "when-ompi-determines-its-necessary", etc., where the last  
value can use multiple criteria to know if it's necessary to do a  
modex (e.g., job size, when spawn occurs, whether the "pml" [or other  
critical] MCA param[s] were specified, ...etc.).



On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote:


Just to complete this thread...

Brian raised a very good point, so we identified it on the weekly  
telecon as
a subject that really should be discussed at next week's technical  
meeting.
I think we can find a reasonable answer, but there are several ways  
it can
be done. So rather than doing our usual piecemeal approach to the  
solution,

it makes sense to begin talking about a more holistic design for
accommodating both needs.

Thanks Brian for pointing out the bigger picture.
Ralph



On 6/24/08 8:22 AM, "Brian W. Barrett"  wrote:

yeah, that could be a problem, but it's such a minority case and  
we've got

to draw the line somewhere.

Of course, it seems like this is a never ending battle between two
opposing forces...  The desire to do the "right thing" all the time  
at
small and medium scale and the desire to scale out to the "big  
thing".
It seems like in the quest to kill off the modex, we've run into  
these

pretty often.

The modex doesn't hurt us at small scale (indeed, we're probably ok  
with

the routed communication pattern up to 512 nodes or so if we don't do
anything stupid, maybe further).  Is it time to admit defeat in this
argument and have a configure option that turns off the modex (at  
the cost
of some of these correctness checks) for the large machines, but  
keeps
things simple for the common case?  I'm sure there are other things  
where
this will come up, so perhaps a --enable-large-scale?  Maybe it's a  
dumb

idea, but it seems like we've made a lot of compromises lately around
this, where no one ends up really happy with the solution :/.

Brian


On Tue, 24 Jun 2008, George Bosilca wrote:

Brian hinted a possible bug in one of his replies. How does this  
work in the
case of dynamic processes? We can envision several scenarios, but  
lets take a
simple: 2 jobs that get connected with connect/accept. One might  
publish the
PML name (simply because the -mca argument was on) and one might  
not?


george.

On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:


Also sounds good to me.

Note that the most difficult part of the forward-looking plan is  
that we
usually can't tell the difference between "something failed to  
initialize"

and "you don't have support for feature X".

I like the general philosophy of: running out of the box always  
works just

fine, but if you/the sysadmin is smart, you can get performance
improvements.


On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:


I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:


That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:

Okay, so let's explore an alternative that preserves the  
support you are
seeking for the "ignorant user", but doesn't penalize everyone  
else.

What we
could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data.  
All other

procs
simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to  
use for

their
system, we won't penalize their startup time. A user who  
doesn't know

what
to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing  
to

initialize
something that exists on the system could be detected in some  
other

fashion,
letting the local proc abort since it would know that other  
procs that
detected similar capabilities may well have selected that PML.  
For now,

though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"   
wrote:


The problem is that we default to OB1, but that's not the  
right choice

for
some platforms (like Path

Re: [OMPI devel] PML selection logic

2008-06-26 Thread Ralph H Castain
Just to complete this thread...

Brian raised a very good point, so we identified it on the weekly telecon as
a subject that really should be discussed at next week's technical meeting.
I think we can find a reasonable answer, but there are several ways it can
be done. So rather than doing our usual piecemeal approach to the solution,
it makes sense to begin talking about a more holistic design for
accommodating both needs.

Thanks Brian for pointing out the bigger picture.
Ralph



On 6/24/08 8:22 AM, "Brian W. Barrett"  wrote:

> yeah, that could be a problem, but it's such a minority case and we've got
> to draw the line somewhere.
> 
> Of course, it seems like this is a never ending battle between two
> opposing forces...  The desire to do the "right thing" all the time at
> small and medium scale and the desire to scale out to the "big thing".
> It seems like in the quest to kill off the modex, we've run into these
> pretty often.
> 
> The modex doesn't hurt us at small scale (indeed, we're probably ok with
> the routed communication pattern up to 512 nodes or so if we don't do
> anything stupid, maybe further).  Is it time to admit defeat in this
> argument and have a configure option that turns off the modex (at the cost
> of some of these correctness checks) for the large machines, but keeps
> things simple for the common case?  I'm sure there are other things where
> this will come up, so perhaps a --enable-large-scale?  Maybe it's a dumb
> idea, but it seems like we've made a lot of compromises lately around
> this, where no one ends up really happy with the solution :/.
> 
> Brian
> 
> 
> On Tue, 24 Jun 2008, George Bosilca wrote:
> 
>> Brian hinted a possible bug in one of his replies. How does this work in the
>> case of dynamic processes? We can envision several scenarios, but lets take a
>> simple: 2 jobs that get connected with connect/accept. One might publish the
>> PML name (simply because the -mca argument was on) and one might not?
>> 
>> george.
>> 
>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
>> 
>>> Also sounds good to me.
>>> 
>>> Note that the most difficult part of the forward-looking plan is that we
>>> usually can't tell the difference between "something failed to initialize"
>>> and "you don't have support for feature X".
>>> 
>>> I like the general philosophy of: running out of the box always works just
>>> fine, but if you/the sysadmin is smart, you can get performance
>>> improvements.
>>> 
>>> 
>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>>> 
 I concur
 - galen
 
 On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
 
> That sounds like a reasonable plan to me.
> 
> Brian
> 
> On Mon, 23 Jun 2008, Ralph H Castain wrote:
> 
>> Okay, so let's explore an alternative that preserves the support you are
>> seeking for the "ignorant user", but doesn't penalize everyone else.
>> What we
>> could do is simply set things up so that:
>> 
>> 1. if -mca plm xyz is provided, then no modex data is added
>> 
>> 2. if it is not provided, then only rank=0 inserts the data. All other
>> procs
>> simply check their own selection against the one given by rank=0
>> 
>> Now, if a knowledgeable user or sys admin specifies what to use for
>> their
>> system, we won't penalize their startup time. A user who doesn't know
>> what
>> to do gets to run, albeit less scalably on startup.
>> 
>> Looking forward from there, we can look to a day where failing to
>> initialize
>> something that exists on the system could be detected in some other
>> fashion,
>> letting the local proc abort since it would know that other procs that
>> detected similar capabilities may well have selected that PML. For now,
>> though, this would solve the problem.
>> 
>> Make sense?
>> Ralph
>> 
>> 
>> 
>> On 6/23/08 1:31 PM, "Brian W. Barrett"  wrote:
>> 
>>> The problem is that we default to OB1, but that's not the right choice
>>> for
>>> some platforms (like Pathscale / PSM), where there's a huge performance
>>> hit for using OB1.  So we run into a situation where user installs Open
>>> MPI, starts running, gets horrible performance, bad mouths Open MPI,
>>> and
>>> now we're in that game again.  Yeah, the sys admin should know what to
>>> do,
>>> but it doesn't always work that way.
>>> 
>>> Brian
>>> 
>>> 
>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>> 
 My fault - I should be more precise in my language. ;-/
 
 #1 is not adequate, IMHO, as it forces us to -always- do a modex. It
 seems
 to me that a simpler solution to what you describe is for the user to
 specify -mca pml ob1, or -mca pml cm. If the latter, then you could
 deal
 with the failed-to-initialize problem cleanly by having the proc
 di

Re: [OMPI devel] PML selection logic

2008-06-24 Thread Brian W. Barrett
yeah, that could be a problem, but it's such a minority case and we've got 
to draw the line somewhere.


Of course, it seems like this is a never ending battle between two 
opposing forces...  The desire to do the "right thing" all the time at 
small and medium scale and the desire to scale out to the "big thing". 
It seems like in the quest to kill off the modex, we've run into these 
pretty often.


The modex doesn't hurt us at small scale (indeed, we're probably ok with 
the routed communication pattern up to 512 nodes or so if we don't do 
anything stupid, maybe further).  Is it time to admit defeat in this 
argument and have a configure option that turns off the modex (at the cost 
of some of these correctness checks) for the large machines, but keeps 
things simple for the common case?  I'm sure there are other things where 
this will come up, so perhaps a --enable-large-scale?  Maybe it's a dumb 
idea, but it seems like we've made a lot of compromises lately around 
this, where no one ends up really happy with the solution :/.


Brian


On Tue, 24 Jun 2008, George Bosilca wrote:

Brian hinted a possible bug in one of his replies. How does this work in the 
case of dynamic processes? We can envision several scenarios, but lets take a 
simple: 2 jobs that get connected with connect/accept. One might publish the 
PML name (simply because the -mca argument was on) and one might not?


george.

On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:


Also sounds good to me.

Note that the most difficult part of the forward-looking plan is that we 
usually can't tell the difference between "something failed to initialize" 
and "you don't have support for feature X".


I like the general philosophy of: running out of the box always works just 
fine, but if you/the sysadmin is smart, you can get performance 
improvements.



On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:


I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:


That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:


Okay, so let's explore an alternative that preserves the support you are
seeking for the "ignorant user", but doesn't penalize everyone else. 
What we

could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All other 
procs

simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use for 
their
system, we won't penalize their startup time. A user who doesn't know 
what

to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to 
initialize
something that exists on the system could be detected in some other 
fashion,

letting the local proc abort since it would know that other procs that
detected similar capabilities may well have selected that PML. For now,
though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"  wrote:

The problem is that we default to OB1, but that's not the right choice 
for

some platforms (like Pathscale / PSM), where there's a huge performance
hit for using OB1.  So we run into a situation where user installs Open
MPI, starts running, gets horrible performance, bad mouths Open MPI, 
and
now we're in that game again.  Yeah, the sys admin should know what to 
do,

but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It 
seems

to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could 
deal
with the failed-to-initialize problem cleanly by having the proc 
directly

abort.

Again, sometimes I think we attempt to automate too many things. This 
seems
like a pretty clear case where you know what you want - the sys admin, 
if
nobody else, can certainly set that mca param in the default param 
file!


Otherwise, it seems to me that you are relying on the modex to detect 
that
your proc failed to init the correct subsystem. I hate to force a 
modex just
for that - if so, then perhaps this could again be a settable option 
to
avoid requiring non-scalable behavior for those of us who want 
scalability?



On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:

The selection code was added because frequently high speed 
interconnects
fail to initialize properly due to random stuff happening (yes, 
that's a
horrible statement, but true).  We ran into a situation with some 
really
flaky machines where most of the processes would chose CM, but a 
couple
would fail to initialize the MTL and therefore chose OB1.  This lead 
to a

hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn particularly
well.  An

Re: [OMPI devel] PML selection logic

2008-06-24 Thread Ralph H Castain
It is a good point. What I have prototyped would still handle it -
basically, it checks to see if any data has been published and does a modex
if so.

So if one side does send modex data, the other side will faithfully decode
it. I think the bigger issue will be if both sides don't, and they don't
match.

So perhaps for dynamic processes we just have to force the modex, just like
we force other things to happen that don't occur during a normal startup.
The prototype will handle that just fine, but I wasn't planning on
committing it into 1.3 just so we could test all these use-cases.


On 6/24/08 8:16 AM, "George Bosilca"  wrote:

> Brian hinted a possible bug in one of his replies. How does this work
> in the case of dynamic processes? We can envision several scenarios,
> but lets take a simple: 2 jobs that get connected with connect/accept.
> One might publish the PML name (simply because the -mca argument was
> on) and one might not?
> 
>george.
> 
> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
> 
>> Also sounds good to me.
>> 
>> Note that the most difficult part of the forward-looking plan is
>> that we usually can't tell the difference between "something failed
>> to initialize" and "you don't have support for feature X".
>> 
>> I like the general philosophy of: running out of the box always
>> works just fine, but if you/the sysadmin is smart, you can get
>> performance improvements.
>> 
>> 
>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>> 
>>> I concur
>>> - galen
>>> 
>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>> 
 That sounds like a reasonable plan to me.
 
 Brian
 
 On Mon, 23 Jun 2008, Ralph H Castain wrote:
 
> Okay, so let's explore an alternative that preserves the support
> you are
> seeking for the "ignorant user", but doesn't penalize everyone
> else. What we
> could do is simply set things up so that:
> 
> 1. if -mca plm xyz is provided, then no modex data is added
> 
> 2. if it is not provided, then only rank=0 inserts the data. All
> other procs
> simply check their own selection against the one given by rank=0
> 
> Now, if a knowledgeable user or sys admin specifies what to use
> for their
> system, we won't penalize their startup time. A user who doesn't
> know what
> to do gets to run, albeit less scalably on startup.
> 
> Looking forward from there, we can look to a day where failing to
> initialize
> something that exists on the system could be detected in some
> other fashion,
> letting the local proc abort since it would know that other procs
> that
> detected similar capabilities may well have selected that PML.
> For now,
> though, this would solve the problem.
> 
> Make sense?
> Ralph
> 
> 
> 
> On 6/23/08 1:31 PM, "Brian W. Barrett" 
> wrote:
> 
>> The problem is that we default to OB1, but that's not the right
>> choice for
>> some platforms (like Pathscale / PSM), where there's a huge
>> performance
>> hit for using OB1.  So we run into a situation where user
>> installs Open
>> MPI, starts running, gets horrible performance, bad mouths Open
>> MPI, and
>> now we're in that game again.  Yeah, the sys admin should know
>> what to do,
>> but it doesn't always work that way.
>> 
>> Brian
>> 
>> 
>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>> 
>>> My fault - I should be more precise in my language. ;-/
>>> 
>>> #1 is not adequate, IMHO, as it forces us to -always- do a
>>> modex. It seems
>>> to me that a simpler solution to what you describe is for the
>>> user to
>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you
>>> could deal
>>> with the failed-to-initialize problem cleanly by having the
>>> proc directly
>>> abort.
>>> 
>>> Again, sometimes I think we attempt to automate too many
>>> things. This seems
>>> like a pretty clear case where you know what you want - the sys
>>> admin, if
>>> nobody else, can certainly set that mca param in the default
>>> param file!
>>> 
>>> Otherwise, it seems to me that you are relying on the modex to
>>> detect that
>>> your proc failed to init the correct subsystem. I hate to force
>>> a modex just
>>> for that - if so, then perhaps this could again be a settable
>>> option to
>>> avoid requiring non-scalable behavior for those of us who want
>>> scalability?
>>> 
>>> 
>>> On 6/23/08 1:21 PM, "Brian W. Barrett" 
>>> wrote:
>>> 
 The selection code was added because frequently high speed
 interconnects
 fail to initialize properly due to random stuff happening
 (yes, that's a
 horrible statement, but true).  We ran into a situation with
 some really
 flaky machines where most 

Re: [OMPI devel] PML selection logic

2008-06-24 Thread George Bosilca
Brian hinted a possible bug in one of his replies. How does this work  
in the case of dynamic processes? We can envision several scenarios,  
but lets take a simple: 2 jobs that get connected with connect/accept.  
One might publish the PML name (simply because the -mca argument was  
on) and one might not?


  george.

On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:


Also sounds good to me.

Note that the most difficult part of the forward-looking plan is  
that we usually can't tell the difference between "something failed  
to initialize" and "you don't have support for feature X".


I like the general philosophy of: running out of the box always  
works just fine, but if you/the sysadmin is smart, you can get  
performance improvements.



On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:


I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:


That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:

Okay, so let's explore an alternative that preserves the support  
you are
seeking for the "ignorant user", but doesn't penalize everyone  
else. What we

could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All  
other procs

simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use  
for their
system, we won't penalize their startup time. A user who doesn't  
know what

to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to  
initialize
something that exists on the system could be detected in some  
other fashion,
letting the local proc abort since it would know that other procs  
that
detected similar capabilities may well have selected that PML.  
For now,

though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"   
wrote:


The problem is that we default to OB1, but that's not the right  
choice for
some platforms (like Pathscale / PSM), where there's a huge  
performance
hit for using OB1.  So we run into a situation where user  
installs Open
MPI, starts running, gets horrible performance, bad mouths Open  
MPI, and
now we're in that game again.  Yeah, the sys admin should know  
what to do,

but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a  
modex. It seems
to me that a simpler solution to what you describe is for the  
user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you  
could deal
with the failed-to-initialize problem cleanly by having the  
proc directly

abort.

Again, sometimes I think we attempt to automate too many  
things. This seems
like a pretty clear case where you know what you want - the sys  
admin, if
nobody else, can certainly set that mca param in the default  
param file!


Otherwise, it seems to me that you are relying on the modex to  
detect that
your proc failed to init the correct subsystem. I hate to force  
a modex just
for that - if so, then perhaps this could again be a settable  
option to
avoid requiring non-scalable behavior for those of us who want  
scalability?



On 6/23/08 1:21 PM, "Brian W. Barrett"   
wrote:


The selection code was added because frequently high speed  
interconnects
fail to initialize properly due to random stuff happening  
(yes, that's a
horrible statement, but true).  We ran into a situation with  
some really
flaky machines where most of the processes would chose CM, but  
a couple
would fail to initialize the MTL and therefore chose OB1.   
This lead to a

hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn  
particularly
well.  And spawn is generally used in environments where such  
network

mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of  
this eventual PML
selection logic? It would help to hear an example of how and  
why different
procs could get different answers - and why we would want to  
allow them to

do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller" > wrote:


The first approach sounds fair enough to me. We should avoid  
2 and 3

as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major  
design bug in
the BTL selection process. When using the complete PML  
selection, BTL
would be initialized several times, leading to a variety of  
bugs.
Eventually the PML selection should return to its old self,  
when the

BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research i

Re: [OMPI devel] PML selection logic

2008-06-24 Thread Jeff Squyres

Also sounds good to me.

Note that the most difficult part of the forward-looking plan is that  
we usually can't tell the difference between "something failed to  
initialize" and "you don't have support for feature X".


I like the general philosophy of: running out of the box always works  
just fine, but if you/the sysadmin is smart, you can get performance  
improvements.



On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:


I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:


That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:

Okay, so let's explore an alternative that preserves the support  
you are
seeking for the "ignorant user", but doesn't penalize everyone  
else. What we

could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All  
other procs

simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use  
for their
system, we won't penalize their startup time. A user who doesn't  
know what

to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to  
initialize
something that exists on the system could be detected in some  
other fashion,
letting the local proc abort since it would know that other procs  
that
detected similar capabilities may well have selected that PML. For  
now,

though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"   
wrote:


The problem is that we default to OB1, but that's not the right  
choice for
some platforms (like Pathscale / PSM), where there's a huge  
performance
hit for using OB1.  So we run into a situation where user  
installs Open
MPI, starts running, gets horrible performance, bad mouths Open  
MPI, and
now we're in that game again.  Yeah, the sys admin should know  
what to do,

but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a  
modex. It seems
to me that a simpler solution to what you describe is for the  
user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you  
could deal
with the failed-to-initialize problem cleanly by having the proc  
directly

abort.

Again, sometimes I think we attempt to automate too many things.  
This seems
like a pretty clear case where you know what you want - the sys  
admin, if
nobody else, can certainly set that mca param in the default  
param file!


Otherwise, it seems to me that you are relying on the modex to  
detect that
your proc failed to init the correct subsystem. I hate to force  
a modex just
for that - if so, then perhaps this could again be a settable  
option to
avoid requiring non-scalable behavior for those of us who want  
scalability?



On 6/23/08 1:21 PM, "Brian W. Barrett"   
wrote:


The selection code was added because frequently high speed  
interconnects
fail to initialize properly due to random stuff happening (yes,  
that's a
horrible statement, but true).  We ran into a situation with  
some really
flaky machines where most of the processes would chose CM, but  
a couple
would fail to initialize the MTL and therefore chose OB1.  This  
lead to a

hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn  
particularly
well.  And spawn is generally used in environments where such  
network

mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this  
eventual PML
selection logic? It would help to hear an example of how and  
why different
procs could get different answers - and why we would want to  
allow them to

do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller" > wrote:


The first approach sounds fair enough to me. We should avoid  
2 and 3

as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major  
design bug in
the BTL selection process. When using the complete PML  
selection, BTL
would be initialized several times, leading to a variety of  
bugs.
Eventually the PML selection should return to its old self,  
when the

BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came  
across

something I
don't fully understand. It seems we have each process insert  
into

the modex
the name of the PML module that it selected. Once the modex  
has

exchanged
that info, it then loops across all procs in the job to  
check their
selection, and aborts if any proc picked a different PML  
module.


All well and good...assuming that

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Shipman, Galen M.

I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:


That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:

Okay, so let's explore an alternative that preserves the support  
you are
seeking for the "ignorant user", but doesn't penalize everyone  
else. What we

could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All  
other procs

simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use  
for their
system, we won't penalize their startup time. A user who doesn't  
know what

to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to  
initialize
something that exists on the system could be detected in some  
other fashion,
letting the local proc abort since it would know that other procs  
that
detected similar capabilities may well have selected that PML. For  
now,

though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"  wrote:

The problem is that we default to OB1, but that's not the right  
choice for
some platforms (like Pathscale / PSM), where there's a huge  
performance
hit for using OB1.  So we run into a situation where user  
installs Open
MPI, starts running, gets horrible performance, bad mouths Open  
MPI, and
now we're in that game again.  Yeah, the sys admin should know  
what to do,

but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a  
modex. It seems
to me that a simpler solution to what you describe is for the  
user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you  
could deal
with the failed-to-initialize problem cleanly by having the proc  
directly

abort.

Again, sometimes I think we attempt to automate too many things.  
This seems
like a pretty clear case where you know what you want - the sys  
admin, if
nobody else, can certainly set that mca param in the default  
param file!


Otherwise, it seems to me that you are relying on the modex to  
detect that
your proc failed to init the correct subsystem. I hate to force  
a modex just
for that - if so, then perhaps this could again be a settable  
option to
avoid requiring non-scalable behavior for those of us who want  
scalability?



On 6/23/08 1:21 PM, "Brian W. Barrett"   
wrote:


The selection code was added because frequently high speed  
interconnects
fail to initialize properly due to random stuff happening (yes,  
that's a
horrible statement, but true).  We ran into a situation with  
some really
flaky machines where most of the processes would chose CM, but  
a couple
would fail to initialize the MTL and therefore chose OB1.  This  
lead to a

hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn  
particularly
well.  And spawn is generally used in environments where such  
network

mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this  
eventual PML
selection logic? It would help to hear an example of how and  
why different
procs could get different answers - and why we would want to  
allow them to

do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  
 wrote:


The first approach sounds fair enough to me. We should avoid  
2 and 3

as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major  
design bug in
the BTL selection process. When using the complete PML  
selection, BTL
would be initialized several times, leading to a variety of  
bugs.
Eventually the PML selection should return to its old self,  
when the

BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert  
into

the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to  
check their
selection, and aborts if any proc picked a different PML  
module.


All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I  
look

inside the
PML's at their selection logic, I find that a proc can ONLY  
pick a

module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or  
by using a

module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to  
pick that

same
modul

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Brian W. Barrett

That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:


Okay, so let's explore an alternative that preserves the support you are
seeking for the "ignorant user", but doesn't penalize everyone else. What we
could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All other procs
simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use for their
system, we won't penalize their startup time. A user who doesn't know what
to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to initialize
something that exists on the system could be detected in some other fashion,
letting the local proc abort since it would know that other procs that
detected similar capabilities may well have selected that PML. For now,
though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"  wrote:


The problem is that we default to OB1, but that's not the right choice for
some platforms (like Pathscale / PSM), where there's a huge performance
hit for using OB1.  So we run into a situation where user installs Open
MPI, starts running, gets horrible performance, bad mouths Open MPI, and
now we're in that game again.  Yeah, the sys admin should know what to do,
but it doesn't always work that way.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
with the failed-to-initialize problem cleanly by having the proc directly
abort.

Again, sometimes I think we attempt to automate too many things. This seems
like a pretty clear case where you know what you want - the sys admin, if
nobody else, can certainly set that mca param in the default param file!

Otherwise, it seems to me that you are relying on the modex to detect that
your proc failed to init the correct subsystem. I hate to force a modex just
for that - if so, then perhaps this could again be a settable option to
avoid requiring non-scalable behavior for those of us who want scalability?


On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:


The selection code was added because frequently high speed interconnects
fail to initialize properly due to random stuff happening (yes, that's a
horrible statement, but true).  We ran into a situation with some really
flaky machines where most of the processes would chose CM, but a couple
would fail to initialize the MTL and therefore chose OB1.  This lead to a
hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn particularly
well.  And spawn is generally used in environments where such network
mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:


The first approach sounds fair enough to me. We should avoid 2 and 3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in
the BTL selection process. When using the complete PML selection, BTL
would be initialized several times, leading to a variety of bugs.
Eventually the PML selection should return to its old self, when the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert into
the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I look
inside the
PML's at their selection logic, I find that a proc can ONLY pick a
module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to pick that
same
module, so that can't cause us to abort (we will have already
returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, an

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
Okay, so let's explore an alternative that preserves the support you are
seeking for the "ignorant user", but doesn't penalize everyone else. What we
could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All other procs
simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use for their
system, we won't penalize their startup time. A user who doesn't know what
to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to initialize
something that exists on the system could be detected in some other fashion,
letting the local proc abort since it would know that other procs that
detected similar capabilities may well have selected that PML. For now,
though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"  wrote:

> The problem is that we default to OB1, but that's not the right choice for
> some platforms (like Pathscale / PSM), where there's a huge performance
> hit for using OB1.  So we run into a situation where user installs Open
> MPI, starts running, gets horrible performance, bad mouths Open MPI, and
> now we're in that game again.  Yeah, the sys admin should know what to do,
> but it doesn't always work that way.
> 
> Brian
> 
> 
> On Mon, 23 Jun 2008, Ralph H Castain wrote:
> 
>> My fault - I should be more precise in my language. ;-/
>> 
>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
>> to me that a simpler solution to what you describe is for the user to
>> specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
>> with the failed-to-initialize problem cleanly by having the proc directly
>> abort.
>> 
>> Again, sometimes I think we attempt to automate too many things. This seems
>> like a pretty clear case where you know what you want - the sys admin, if
>> nobody else, can certainly set that mca param in the default param file!
>> 
>> Otherwise, it seems to me that you are relying on the modex to detect that
>> your proc failed to init the correct subsystem. I hate to force a modex just
>> for that - if so, then perhaps this could again be a settable option to
>> avoid requiring non-scalable behavior for those of us who want scalability?
>> 
>> 
>> On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:
>> 
>>> The selection code was added because frequently high speed interconnects
>>> fail to initialize properly due to random stuff happening (yes, that's a
>>> horrible statement, but true).  We ran into a situation with some really
>>> flaky machines where most of the processes would chose CM, but a couple
>>> would fail to initialize the MTL and therefore chose OB1.  This lead to a
>>> hang situation, which is the worst of the worst.
>>> 
>>> I think #1 is adequate, although it doesn't handle spawn particularly
>>> well.  And spawn is generally used in environments where such network
>>> mismatches are most likely to occur.
>>> 
>>> Brian
>>> 
>>> 
>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>> 
 Since my goal is to eliminate the modex completely for managed
 installations, could you give me a brief understanding of this eventual PML
 selection logic? It would help to hear an example of how and why different
 procs could get different answers - and why we would want to allow them to
 do so.
 
 Thanks
 Ralph
 
 
 
 On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:
 
> The first approach sounds fair enough to me. We should avoid 2 and 3
> as the pml selection mechanism used to be
> more complex before we reduced it to accommodate a major design bug in
> the BTL selection process. When using the complete PML selection, BTL
> would be initialized several times, leading to a variety of bugs.
> Eventually the PML selection should return to its old self, when the
> BTL bug gets fixed.
> 
> Aurelien
> 
> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
> 
>> Yo all
>> 
>> I've been doing further research into the modex and came across
>> something I
>> don't fully understand. It seems we have each process insert into
>> the modex
>> the name of the PML module that it selected. Once the modex has
>> exchanged
>> that info, it then loops across all procs in the job to check their
>> selection, and aborts if any proc picked a different PML module.
>> 
>> All well and good...assuming that procs actually -can- choose
>> different PML
>> modules and hence create an "abort" scenario. However, if I look
>> inside the
>> PML's at their selection logic, I find that a proc can ONLY pick a
>> module
>> other than ob1 if:
>> 
>> 1. the user specifies the module to use via -mca pml xyz or by using a
>> module spe

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Brian W. Barrett
The problem is that we default to OB1, but that's not the right choice for 
some platforms (like Pathscale / PSM), where there's a huge performance 
hit for using OB1.  So we run into a situation where user installs Open 
MPI, starts running, gets horrible performance, bad mouths Open MPI, and 
now we're in that game again.  Yeah, the sys admin should know what to do, 
but it doesn't always work that way.


Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
with the failed-to-initialize problem cleanly by having the proc directly
abort.

Again, sometimes I think we attempt to automate too many things. This seems
like a pretty clear case where you know what you want - the sys admin, if
nobody else, can certainly set that mca param in the default param file!

Otherwise, it seems to me that you are relying on the modex to detect that
your proc failed to init the correct subsystem. I hate to force a modex just
for that - if so, then perhaps this could again be a settable option to
avoid requiring non-scalable behavior for those of us who want scalability?


On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:


The selection code was added because frequently high speed interconnects
fail to initialize properly due to random stuff happening (yes, that's a
horrible statement, but true).  We ran into a situation with some really
flaky machines where most of the processes would chose CM, but a couple
would fail to initialize the MTL and therefore chose OB1.  This lead to a
hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn particularly
well.  And spawn is generally used in environments where such network
mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:


The first approach sounds fair enough to me. We should avoid 2 and 3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in
the BTL selection process. When using the complete PML selection, BTL
would be initialized several times, leading to a variety of bugs.
Eventually the PML selection should return to its old self, when the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert into
the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I look
inside the
PML's at their selection logic, I find that a proc can ONLY pick a
module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to pick that
same
module, so that can't cause us to abort (we will have already
returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and
that it is
other than "psm". In this case, the CM module will be selected
because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me
that you
either have the required capability or you don't. I can see that in
some
environments (e.g., rsh across unmanaged collections of machines),
it might
be possible for someone to launch across a set of machines where
some do and
some don't have the required support. However, in all other cases,
this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should
feel free
to confirm or correct it), it seems to me that this could be
streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the
modex,
and other procs simply check it against their own and return an
error if
they differ. This accomplishes the identical functionality to what
we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by
requiring the
user to specify the PML module i

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
with the failed-to-initialize problem cleanly by having the proc directly
abort.

Again, sometimes I think we attempt to automate too many things. This seems
like a pretty clear case where you know what you want - the sys admin, if
nobody else, can certainly set that mca param in the default param file!

Otherwise, it seems to me that you are relying on the modex to detect that
your proc failed to init the correct subsystem. I hate to force a modex just
for that - if so, then perhaps this could again be a settable option to
avoid requiring non-scalable behavior for those of us who want scalability?


On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:

> The selection code was added because frequently high speed interconnects
> fail to initialize properly due to random stuff happening (yes, that's a
> horrible statement, but true).  We ran into a situation with some really
> flaky machines where most of the processes would chose CM, but a couple
> would fail to initialize the MTL and therefore chose OB1.  This lead to a
> hang situation, which is the worst of the worst.
> 
> I think #1 is adequate, although it doesn't handle spawn particularly
> well.  And spawn is generally used in environments where such network
> mismatches are most likely to occur.
> 
> Brian
> 
> 
> On Mon, 23 Jun 2008, Ralph H Castain wrote:
> 
>> Since my goal is to eliminate the modex completely for managed
>> installations, could you give me a brief understanding of this eventual PML
>> selection logic? It would help to hear an example of how and why different
>> procs could get different answers - and why we would want to allow them to
>> do so.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> 
>> On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:
>> 
>>> The first approach sounds fair enough to me. We should avoid 2 and 3
>>> as the pml selection mechanism used to be
>>> more complex before we reduced it to accommodate a major design bug in
>>> the BTL selection process. When using the complete PML selection, BTL
>>> would be initialized several times, leading to a variety of bugs.
>>> Eventually the PML selection should return to its old self, when the
>>> BTL bug gets fixed.
>>> 
>>> Aurelien
>>> 
>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>> 
 Yo all
 
 I've been doing further research into the modex and came across
 something I
 don't fully understand. It seems we have each process insert into
 the modex
 the name of the PML module that it selected. Once the modex has
 exchanged
 that info, it then loops across all procs in the job to check their
 selection, and aborts if any proc picked a different PML module.
 
 All well and good...assuming that procs actually -can- choose
 different PML
 modules and hence create an "abort" scenario. However, if I look
 inside the
 PML's at their selection logic, I find that a proc can ONLY pick a
 module
 other than ob1 if:
 
 1. the user specifies the module to use via -mca pml xyz or by using a
 module specific mca param to adjust its priority. In this case,
 since the
 mca param is propagated, ALL procs have no choice but to pick that
 same
 module, so that can't cause us to abort (we will have already
 returned an
 error and aborted if the specified module can't run).
 
 2. the pml/cm module detects that an MTL module was selected, and
 that it is
 other than "psm". In this case, the CM module will be selected
 because its
 default priority is higher than that of OB1.
 
 In looking deeper into the MTL selection logic, it appears to me
 that you
 either have the required capability or you don't. I can see that in
 some
 environments (e.g., rsh across unmanaged collections of machines),
 it might
 be possible for someone to launch across a set of machines where
 some do and
 some don't have the required support. However, in all other cases,
 this will
 be homogeneous across the system.
 
 Given this analysis (and someone more familiar with the PML should
 feel free
 to confirm or correct it), it seems to me that this could be
 streamlined via
 one or more means:
 
 1. at the most, we could have rank=0 add the PML module name to the
 modex,
 and other procs simply check it against their own and return an
 error if
 they differ. This accomplishes the identical functionality to what
 we have
 today, but with much less info in the modex.
 
 2. we could eliminate this info from the modex altogether by
 requiring the
 user to specify the PML module if they want something other than th

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Brian W. Barrett
The selection code was added because frequently high speed interconnects 
fail to initialize properly due to random stuff happening (yes, that's a 
horrible statement, but true).  We ran into a situation with some really 
flaky machines where most of the processes would chose CM, but a couple 
would fail to initialize the MTL and therefore chose OB1.  This lead to a 
hang situation, which is the worst of the worst.


I think #1 is adequate, although it doesn't handle spawn particularly 
well.  And spawn is generally used in environments where such network 
mismatches are most likely to occur.


Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:


The first approach sounds fair enough to me. We should avoid 2 and 3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in
the BTL selection process. When using the complete PML selection, BTL
would be initialized several times, leading to a variety of bugs.
Eventually the PML selection should return to its old self, when the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert into
the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I look
inside the
PML's at their selection logic, I find that a proc can ONLY pick a
module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to pick that
same
module, so that can't cause us to abort (we will have already
returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and
that it is
other than "psm". In this case, the CM module will be selected
because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me
that you
either have the required capability or you don't. I can see that in
some
environments (e.g., rsh across unmanaged collections of machines),
it might
be possible for someone to launch across a set of machines where
some do and
some don't have the required support. However, in all other cases,
this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should
feel free
to confirm or correct it), it seems to me that this could be
streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the
modex,
and other procs simply check it against their own and return an
error if
they differ. This accomplishes the identical functionality to what
we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by
requiring the
user to specify the PML module if they want something other than the
default
OB1. In this case, there can be no confusion over what each proc is
to use.
The CM module will attempt to init the MTL - if it cannot do so,
then the
job will return the correct error and tell the user that CM/MTL
support is
unavailable.

3. we could again eliminate the info by not inserting it into the
modex if
(a) the default PML module is selected, or (b) the user specified
the PML
module to be used. In the first case, each proc can simply check to
see if
they picked the default - if not, then we can insert the info to
indicate
the difference. Thus, in the "standard" case, no info will be
inserted.

In the second case, we will already get an error if the specified
PML module
could not be used. Hence, the modex check provides no additional
info or
value.

I understand the motivation to support automation. However, in this
case,
the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be
in order?

Ralph



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing 

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:

> The first approach sounds fair enough to me. We should avoid 2 and 3
> as the pml selection mechanism used to be
> more complex before we reduced it to accommodate a major design bug in
> the BTL selection process. When using the complete PML selection, BTL
> would be initialized several times, leading to a variety of bugs.
> Eventually the PML selection should return to its old self, when the
> BTL bug gets fixed.
> 
> Aurelien
> 
> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
> 
>> Yo all
>> 
>> I've been doing further research into the modex and came across
>> something I
>> don't fully understand. It seems we have each process insert into
>> the modex
>> the name of the PML module that it selected. Once the modex has
>> exchanged
>> that info, it then loops across all procs in the job to check their
>> selection, and aborts if any proc picked a different PML module.
>> 
>> All well and good...assuming that procs actually -can- choose
>> different PML
>> modules and hence create an "abort" scenario. However, if I look
>> inside the
>> PML's at their selection logic, I find that a proc can ONLY pick a
>> module
>> other than ob1 if:
>> 
>> 1. the user specifies the module to use via -mca pml xyz or by using a
>> module specific mca param to adjust its priority. In this case,
>> since the
>> mca param is propagated, ALL procs have no choice but to pick that
>> same
>> module, so that can't cause us to abort (we will have already
>> returned an
>> error and aborted if the specified module can't run).
>> 
>> 2. the pml/cm module detects that an MTL module was selected, and
>> that it is
>> other than "psm". In this case, the CM module will be selected
>> because its
>> default priority is higher than that of OB1.
>> 
>> In looking deeper into the MTL selection logic, it appears to me
>> that you
>> either have the required capability or you don't. I can see that in
>> some
>> environments (e.g., rsh across unmanaged collections of machines),
>> it might
>> be possible for someone to launch across a set of machines where
>> some do and
>> some don't have the required support. However, in all other cases,
>> this will
>> be homogeneous across the system.
>> 
>> Given this analysis (and someone more familiar with the PML should
>> feel free
>> to confirm or correct it), it seems to me that this could be
>> streamlined via
>> one or more means:
>> 
>> 1. at the most, we could have rank=0 add the PML module name to the
>> modex,
>> and other procs simply check it against their own and return an
>> error if
>> they differ. This accomplishes the identical functionality to what
>> we have
>> today, but with much less info in the modex.
>> 
>> 2. we could eliminate this info from the modex altogether by
>> requiring the
>> user to specify the PML module if they want something other than the
>> default
>> OB1. In this case, there can be no confusion over what each proc is
>> to use.
>> The CM module will attempt to init the MTL - if it cannot do so,
>> then the
>> job will return the correct error and tell the user that CM/MTL
>> support is
>> unavailable.
>> 
>> 3. we could again eliminate the info by not inserting it into the
>> modex if
>> (a) the default PML module is selected, or (b) the user specified
>> the PML
>> module to be used. In the first case, each proc can simply check to
>> see if
>> they picked the default - if not, then we can insert the info to
>> indicate
>> the difference. Thus, in the "standard" case, no info will be
>> inserted.
>> 
>> In the second case, we will already get an error if the specified
>> PML module
>> could not be used. Hence, the modex check provides no additional
>> info or
>> value.
>> 
>> I understand the motivation to support automation. However, in this
>> case,
>> the automation actually doesn't seem to buy us very much, and it isn't
>> coming "free". So perhaps some change in how this is done would be
>> in order?
>> 
>> Ralph
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] PML selection logic

2008-06-23 Thread Aurélien Bouteiller
The first approach sounds fair enough to me. We should avoid 2 and 3  
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in  
the BTL selection process. When using the complete PML selection, BTL  
would be initialized several times, leading to a variety of bugs.  
Eventually the PML selection should return to its old self, when the  
BTL bug gets fixed.


Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across  
something I
don't fully understand. It seems we have each process insert into  
the modex
the name of the PML module that it selected. Once the modex has  
exchanged

that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose  
different PML
modules and hence create an "abort" scenario. However, if I look  
inside the
PML's at their selection logic, I find that a proc can ONLY pick a  
module

other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,  
since the
mca param is propagated, ALL procs have no choice but to pick that  
same
module, so that can't cause us to abort (we will have already  
returned an

error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and  
that it is
other than "psm". In this case, the CM module will be selected  
because its

default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me  
that you
either have the required capability or you don't. I can see that in  
some
environments (e.g., rsh across unmanaged collections of machines),  
it might
be possible for someone to launch across a set of machines where  
some do and
some don't have the required support. However, in all other cases,  
this will

be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should  
feel free
to confirm or correct it), it seems to me that this could be  
streamlined via

one or more means:

1. at the most, we could have rank=0 add the PML module name to the  
modex,
and other procs simply check it against their own and return an  
error if
they differ. This accomplishes the identical functionality to what  
we have

today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by  
requiring the
user to specify the PML module if they want something other than the  
default
OB1. In this case, there can be no confusion over what each proc is  
to use.
The CM module will attempt to init the MTL - if it cannot do so,  
then the
job will return the correct error and tell the user that CM/MTL  
support is

unavailable.

3. we could again eliminate the info by not inserting it into the  
modex if
(a) the default PML module is selected, or (b) the user specified  
the PML
module to be used. In the first case, each proc can simply check to  
see if
they picked the default - if not, then we can insert the info to  
indicate
the difference. Thus, in the "standard" case, no info will be  
inserted.


In the second case, we will already get an error if the specified  
PML module
could not be used. Hence, the modex check provides no additional  
info or

value.

I understand the motivation to support automation. However, in this  
case,

the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be  
in order?


Ralph



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
Yo all

I've been doing further research into the modex and came across something I
don't fully understand. It seems we have each process insert into the modex
the name of the PML module that it selected. Once the modex has exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose different PML
modules and hence create an "abort" scenario. However, if I look inside the
PML's at their selection logic, I find that a proc can ONLY pick a module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case, since the
mca param is propagated, ALL procs have no choice but to pick that same
module, so that can't cause us to abort (we will have already returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and that it is
other than "psm". In this case, the CM module will be selected because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me that you
either have the required capability or you don't. I can see that in some
environments (e.g., rsh across unmanaged collections of machines), it might
be possible for someone to launch across a set of machines where some do and
some don't have the required support. However, in all other cases, this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should feel free
to confirm or correct it), it seems to me that this could be streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the modex,
and other procs simply check it against their own and return an error if
they differ. This accomplishes the identical functionality to what we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by requiring the
user to specify the PML module if they want something other than the default
OB1. In this case, there can be no confusion over what each proc is to use.
The CM module will attempt to init the MTL - if it cannot do so, then the
job will return the correct error and tell the user that CM/MTL support is
unavailable.

3. we could again eliminate the info by not inserting it into the modex if
(a) the default PML module is selected, or (b) the user specified the PML
module to be used. In the first case, each proc can simply check to see if
they picked the default - if not, then we can insert the info to indicate
the difference. Thus, in the "standard" case, no info will be inserted.

In the second case, we will already get an error if the specified PML module
could not be used. Hence, the modex check provides no additional info or
value.

I understand the motivation to support automation. However, in this case,
the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be in order?

Ralph