Re: [OMPI devel] Multi-Rail and Open IB BTL

2007-11-13 Thread Jeff Squyres

On Nov 9, 2007, at 1:24 PM, Don Kerr wrote:

both, I was thinking of listing what I think are multi-rail  
requirements

but wanted to understand what the current state of things are


I believe the OF portion of the FAQ describes what we do in the v1.2  
series (right Gleb?); I honestly don't remember what we do today on  
the trunk (I'm pretty sure that Gleb has tweaked it recently).


As for what we *should* do, it's a very complicated question.  :-\

This is where all these discussions regarding affinity, NUMA, and NUNA  
(non uniform network architecture) come into play.  A "very simple"  
scenario may be something like this:


- host A is UMA (perhaps even a uniprocessor) with 2 ports that are  
equidistant from the 1 MPI process on that host
- host B is the same, except it only has 1 active port on the same IB  
subnet as host A's 2 ports

- the ports on both hosts are all the same speed (e.g., DDR)
- the ports all share a single, common, non-blocking switch

But even with this "simple" case, the answer as to what you should do  
is still unclear.  If host A is able to drive both of its DDR links at  
full speed, you're could cause congestion at the link to host B if the  
MPI process on host A opens two connections.  But if host A is only  
able to drive the same effective bandwidth out of its two ports as it  
is through a single port, then the end effect is probably fairly  
negligible -- it might not make much of a difference at all as to  
whether the MPI process A opens 1 or 2 connections to host B.


But then throw in other effects that I mentioned above (NUMA, NUNA,  
etc.), and the equation becomes much more complex.  In some cases, it  
may be good to open 1 connection (e.g., bandwidth load balancing); in  
other cases it may be good to open 2 (e.g., congestion avoidance /  
spreading traffic around the network, particularly in the presence of  
other MPI jobs on the network).  :-\


Such NUNA architectures may sound unusual to some, but both IBM and HP  
sell [many] blade-based HPC solutions with NUNA internal IB networks.   
Specifically: this is a fairly common scenario.


So this is a difficult question without a great answer.  The hope is  
that the new carto framework that Sharon sent requirements around for  
will be able to at least make topology information available from both  
the host and the network so that BTLs can possibly make some  
intelligent decisions about what to do in these kinds of scenarios.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] initial SCTP BTL commit comments?

2007-11-13 Thread Jeff Squyres
I have no objections to bringing this into the trunk, but I agree that  
an .ompi_ignore is probably a good idea at first.


One question that I'd like to have answered is how OMPI decides  
whether to use the SCTP BTL or not.  If there are SCTP stacks  
available by default in Linux and OS X -- but their performance may be  
sub-optimal and/or buggy, we may want to have the SCTP BTL only  
activated if the user explicitly asks for it.  Open MPI is very  
concerned with "out of the box" behavior -- we need to ensure that  
"mpirun a.out" will "just work" on all of our supported platforms.


Will UBC setup regular MTT runs to test the SCTP stuff?  :-)

More below.


On Nov 10, 2007, at 9:25 PM, Brad Penoff wrote:


Currently, both the one-to-one and the one-to-many make use of the
event library offered by Open MPI.  The callback functions for the
one-to-many style however are quite unique as multiple endpoints may
be interested in the events that poll returns.  Currently we use  
these
unique callback functions, but in the future the hope is to play  
with

the potential benefits of a btl_progress function, particularly for
the one-to-many style.


In my experience the event callbacks have a high overhead compared  
to a

progress function, so I'd say thats definitely worth checking out.


We noticed that poll is only called after a timer goes off while
btl_progress would be called with each iteration of opal_progress, so
noticing that along with you encouragement makes us want to check it
out even more.



Be aware that based on discussions from the Paris meeting, some  
changes to libevent are coming (I really need to get this on a wiki  
page or something).  Here's a quick summary:


- We're waiting for a new release of libevent (or libev -- we'll see  
how it shakes out) that has lots of bug fixes and performance  
improvements as compared to the version we currently have in the OMPI  
tree.  Based on some libevent mailing list traffic, this release may  
be in Dec 2007.  We'll see what happens.


- After we update libevent, we'll be making a policy change w.r.t.  
OMPI progress functions and timer callbacks: only software layers with  
actual devices will be allowed to register progress functions (in  
particular, the io and osd framework progress functions will be  
eliminated; see below).  All other progress-requiring functions will  
have to use timers.  This means that every time we call progress, we  
*only* call the stuff that needs to be polled as frequently as  
possible.  We'll call the less-important progress stuff less  
frequently (e.g., ORTE OOB/RML).


- We'll be changing our use of libevent to utilize the more scalable  
polling capabilities (such as epoll and friends).  We don't use them  
right now because on all OS's that we currently care about (Linux, OS  
X, Solaris), mixing the scalable fd polling mechanism with pty's  
results in Very Very Bad Things.  We'll special case where pty's are  
used and only use select/poll there, and then use epoll (etc.)  
elsewhere.


- We'll also be changing our use of libevent to utilized timers  
properly.


- ompi_request_t will be augmented to have a callback that, if non- 
NULL, will be invoked when the request is completed.  This will allow  
removing the io and osd framework progress functions.


- We may also add a high-performance clock framework in Open MPI -- a  
way of accessing high-resolution timers and clocks on the host (e.g.,  
on Intel chips, additional algorithms are necessary to normalize the  
per-chip clocks between sockets, especially if a process bounces  
between sockets -- unnecessary on AMD, PPC, and SPARC platforms).   
This could improve performance and precision of the libevent timers.


- Finally, registering progress functions will take a new parameter: a  
file descriptor.  If a file descriptor is provided and opal_progress()  
decides that it wants to block (specific mechanism TBD, but probably  
something similar to what other hybrid polling/blocking systems do:  
poll for a while, and if nothing "interesting" happens, block) *and*  
if all registered progress functions have valid fd's, then we'll block  
until either a timer expires or something "interesting" happens.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] initial SCTP BTL commit comments?

2007-11-13 Thread Brad Penoff
On Nov 12, 2007 3:26 AM, Jeff Squyres  wrote:
> I have no objections to bringing this into the trunk, but I agree that
> an .ompi_ignore is probably a good idea at first.

I'll try to cook up a commit soon then!

> One question that I'd like to have answered is how OMPI decides
> whether to use the SCTP BTL or not.  If there are SCTP stacks
> available by default in Linux and OS X -- but their performance may be
> sub-optimal and/or buggy, we may want to have the SCTP BTL only
> activated if the user explicitly asks for it.  Open MPI is very
> concerned with "out of the box" behavior -- we need to ensure that
> "mpirun a.out" will "just work" on all of our supported platforms.

Just to make a few things explicit...

Things would only work out of the box on FreeBSD, and there the stack
is very good.

We have less experience with the Linux stack but hope the availability
of and SCTP BTL will help encourage its use by us and others.  Now it
is a module by default (loaded with "modprobe sctp") but the actual
SCTP sockets extension API needs to be downloaded and installed
separately.  The so-called lksctp-tools can be obtained here:
http://sourceforge.net/project/showfiles.php?group_id=26529

The OS X stack does not come by default but instead is a kernel extension:
http://sctp.fh-muenster.de/sctp-nke.html
I haven't yet started this testing but intend to soon.  As of now
though, the supplied configure.m4 does not try to even build the
component on Mac OS X.

So in my opinion, things in the configure scripts should be fine the
way the are since only FreeBSD stack (which we have confidence in)
will try to work out of the box; the others require the user to
install things.


A question I had was with respect to what to set for the default value
of btl_sctp_exclusivity... I had wanted the exclusivity to be
"slightly less than TCP" so it was available but not the default.  In
the code I set btl_sctp_exclusivity to this:
MCA_BTL_EXCLUSIVITY_LOW - 1
...however MCA_BTL_EXCLUSIVITY_LOW is defined as 0 and ompi_info says
that exclusivity must be >= 0... a -1 exclusivity doesn't seem to
break anything though...   If two BTLs have the same exclusivity, what
is the tie-break?  Alphabetic order?

>
> Will UBC setup regular MTT runs to test the SCTP stuff?  :-)
>

We've only started playing with MTT so I'm sure we'll have plenty of
questions as we begin this process!

> More below.
>
>
> On Nov 10, 2007, at 9:25 PM, Brad Penoff wrote:
>
> >>> Currently, both the one-to-one and the one-to-many make use of the
> >>> event library offered by Open MPI.  The callback functions for the
> >>> one-to-many style however are quite unique as multiple endpoints may
> >>> be interested in the events that poll returns.  Currently we use
> >>> these
> >>> unique callback functions, but in the future the hope is to play
> >>> with
> >>> the potential benefits of a btl_progress function, particularly for
> >>> the one-to-many style.
> >>
> >> In my experience the event callbacks have a high overhead compared
> >> to a
> >> progress function, so I'd say thats definitely worth checking out.
> >
> > We noticed that poll is only called after a timer goes off while
> > btl_progress would be called with each iteration of opal_progress, so
> > noticing that along with you encouragement makes us want to check it
> > out even more.
>
>
> Be aware that based on discussions from the Paris meeting, some
> changes to libevent are coming (I really need to get this on a wiki
> page or something).  Here's a quick summary:
>
> - We're waiting for a new release of libevent (or libev -- we'll see
> how it shakes out) that has lots of bug fixes and performance
> improvements as compared to the version we currently have in the OMPI
> tree.  Based on some libevent mailing list traffic, this release may
> be in Dec 2007.  We'll see what happens.
>
> - After we update libevent, we'll be making a policy change w.r.t.
> OMPI progress functions and timer callbacks: only software layers with
> actual devices will be allowed to register progress functions (in
> particular, the io and osd framework progress functions will be
> eliminated; see below).  All other progress-requiring functions will
> have to use timers.  This means that every time we call progress, we
> *only* call the stuff that needs to be polled as frequently as
> possible.  We'll call the less-important progress stuff less
> frequently (e.g., ORTE OOB/RML).
>
> - We'll be changing our use of libevent to utilize the more scalable
> polling capabilities (such as epoll and friends).  We don't use them
> right now because on all OS's that we currently care about (Linux, OS
> X, Solaris), mixing the scalable fd polling mechanism with pty's
> results in Very Very Bad Things.  We'll special case where pty's are
> used and only use select/poll there, and then use epoll (etc.)
> elsewhere.
>
> - We'll also be changing our use of libevent to utilized timers
> properly.
>
> - ompi_request_t will be augmented to have

Re: [OMPI devel] initial SCTP BTL commit comments?

2007-11-13 Thread Brad Penoff
On Nov 13, 2007 12:41 PM, Brad Penoff  wrote:
> On Nov 12, 2007 3:26 AM, Jeff Squyres  wrote:
> > I have no objections to bringing this into the trunk, but I agree that
> > an .ompi_ignore is probably a good idea at first.
>
> I'll try to cook up a commit soon then!

It's in there now!
https://svn.open-mpi.org/trac/ompi/changeset/16723

A quick sanity test shows that things are operational.  For others to
use it, they'll have to of course adjust  ompi_ignore (or
.ompi_unignore).

We're playing with MTT now so I'd expect we'll have some questions on
that front in the near future.

Where is the best place to put BTL-specific documentation (for
example, some setup tips and weblinks)?

brad


>
> > One question that I'd like to have answered is how OMPI decides
> > whether to use the SCTP BTL or not.  If there are SCTP stacks
> > available by default in Linux and OS X -- but their performance may be
> > sub-optimal and/or buggy, we may want to have the SCTP BTL only
> > activated if the user explicitly asks for it.  Open MPI is very
> > concerned with "out of the box" behavior -- we need to ensure that
> > "mpirun a.out" will "just work" on all of our supported platforms.
>
> Just to make a few things explicit...
>
> Things would only work out of the box on FreeBSD, and there the stack
> is very good.
>
> We have less experience with the Linux stack but hope the availability
> of and SCTP BTL will help encourage its use by us and others.  Now it
> is a module by default (loaded with "modprobe sctp") but the actual
> SCTP sockets extension API needs to be downloaded and installed
> separately.  The so-called lksctp-tools can be obtained here:
> http://sourceforge.net/project/showfiles.php?group_id=26529
>
> The OS X stack does not come by default but instead is a kernel extension:
> http://sctp.fh-muenster.de/sctp-nke.html
> I haven't yet started this testing but intend to soon.  As of now
> though, the supplied configure.m4 does not try to even build the
> component on Mac OS X.
>
> So in my opinion, things in the configure scripts should be fine the
> way the are since only FreeBSD stack (which we have confidence in)
> will try to work out of the box; the others require the user to
> install things.
>
>
> A question I had was with respect to what to set for the default value
> of btl_sctp_exclusivity... I had wanted the exclusivity to be
> "slightly less than TCP" so it was available but not the default.  In
> the code I set btl_sctp_exclusivity to this:
> MCA_BTL_EXCLUSIVITY_LOW - 1
> ...however MCA_BTL_EXCLUSIVITY_LOW is defined as 0 and ompi_info says
> that exclusivity must be >= 0... a -1 exclusivity doesn't seem to
> break anything though...   If two BTLs have the same exclusivity, what
> is the tie-break?  Alphabetic order?
>
> >
> > Will UBC setup regular MTT runs to test the SCTP stuff?  :-)
> >
>
> We've only started playing with MTT so I'm sure we'll have plenty of
> questions as we begin this process!
>
>
> > More below.
> >
> >
> > On Nov 10, 2007, at 9:25 PM, Brad Penoff wrote:
> >
> > >>> Currently, both the one-to-one and the one-to-many make use of the
> > >>> event library offered by Open MPI.  The callback functions for the
> > >>> one-to-many style however are quite unique as multiple endpoints may
> > >>> be interested in the events that poll returns.  Currently we use
> > >>> these
> > >>> unique callback functions, but in the future the hope is to play
> > >>> with
> > >>> the potential benefits of a btl_progress function, particularly for
> > >>> the one-to-many style.
> > >>
> > >> In my experience the event callbacks have a high overhead compared
> > >> to a
> > >> progress function, so I'd say thats definitely worth checking out.
> > >
> > > We noticed that poll is only called after a timer goes off while
> > > btl_progress would be called with each iteration of opal_progress, so
> > > noticing that along with you encouragement makes us want to check it
> > > out even more.
> >
> >
> > Be aware that based on discussions from the Paris meeting, some
> > changes to libevent are coming (I really need to get this on a wiki
> > page or something).  Here's a quick summary:
> >
> > - We're waiting for a new release of libevent (or libev -- we'll see
> > how it shakes out) that has lots of bug fixes and performance
> > improvements as compared to the version we currently have in the OMPI
> > tree.  Based on some libevent mailing list traffic, this release may
> > be in Dec 2007.  We'll see what happens.
> >
> > - After we update libevent, we'll be making a policy change w.r.t.
> > OMPI progress functions and timer callbacks: only software layers with
> > actual devices will be allowed to register progress functions (in
> > particular, the io and osd framework progress functions will be
> > eliminated; see below).  All other progress-requiring functions will
> > have to use timers.  This means that every time we call progress, we
> > *only* call the stuff that needs to be polled as frequentl