Re: [OMPI devel] Multi-Rail and Open IB BTL

2007-11-14 Thread Gleb Natapov
Sorry I missed a mail with the question.

On Mon, Nov 12, 2007 at 06:03:07AM -0500, Jeff Squyres wrote:
> On Nov 9, 2007, at 1:24 PM, Don Kerr wrote:
> 
> > both, I was thinking of listing what I think are multi-rail  
> > requirements
> > but wanted to understand what the current state of things are
> 
> I believe the OF portion of the FAQ describes what we do in the v1.2  
> series (right Gleb?); I honestly don't remember what we do today on  
> the trunk (I'm pretty sure that Gleb has tweaked it recently).
I haven't tweaked anything related to this recently. If one host has two
ports and another has one port only one connection is established
between them.

--
Gleb.


Re: [OMPI devel] initial SCTP BTL commit comments?

2007-11-14 Thread Terry Dontje

Brad Penoff wrote:

On Nov 12, 2007 3:26 AM, Jeff Squyres  wrote:
  

I have no objections to bringing this into the trunk, but I agree that
an .ompi_ignore is probably a good idea at first.



I'll try to cook up a commit soon then!

  

One question that I'd like to have answered is how OMPI decides
whether to use the SCTP BTL or not.  If there are SCTP stacks
available by default in Linux and OS X -- but their performance may be
sub-optimal and/or buggy, we may want to have the SCTP BTL only
activated if the user explicitly asks for it.  Open MPI is very
concerned with "out of the box" behavior -- we need to ensure that
"mpirun a.out" will "just work" on all of our supported platforms.



Just to make a few things explicit...

Things would only work out of the box on FreeBSD, and there the stack
is very good.

We have less experience with the Linux stack but hope the availability
of and SCTP BTL will help encourage its use by us and others.  Now it
is a module by default (loaded with "modprobe sctp") but the actual
SCTP sockets extension API needs to be downloaded and installed
separately.  The so-called lksctp-tools can be obtained here:
http://sourceforge.net/project/showfiles.php?group_id=26529

The OS X stack does not come by default but instead is a kernel extension:
http://sctp.fh-muenster.de/sctp-nke.html
I haven't yet started this testing but intend to soon.  As of now
though, the supplied configure.m4 does not try to even build the
component on Mac OS X.

So in my opinion, things in the configure scripts should be fine the
way the are since only FreeBSD stack (which we have confidence in)
will try to work out of the box; the others require the user to
install things.
  
I am gathering from the text above you haven't tried your BTL on Solaris 
at all.


--td


Re: [OMPI devel] [OMPI svn] svn:open-mpi r16723

2007-11-14 Thread Tim Prins

Hi,

The following files bother me about this commit:
trunk/ompi/mca/btl/sctp/sctp_writev.c
trunk/ompi/mca/btl/sctp/sctp_writev.h

They bother me for 2 reasons:
1. Their naming does not follow the prefix rule
2. They are LGPL licensed. While I personally like the LGPL, I do not 
believe it is compatible with the BSD license that OMPI is distributed 
under. I think (though I could be wrong) that these files need to be 
removed from the repository and the functionality implemented in some 
other way.


Tim


pen...@osl.iu.edu wrote:

Author: penoff
Date: 2007-11-13 18:39:16 EST (Tue, 13 Nov 2007)
New Revision: 16723
URL: https://svn.open-mpi.org/trac/ompi/changeset/16723

Log:
initial SCTP BTL commit
Added:
   trunk/ompi/mca/btl/sctp/
   trunk/ompi/mca/btl/sctp/.ompi_ignore
   trunk/ompi/mca/btl/sctp/.ompi_unignore
   trunk/ompi/mca/btl/sctp/Makefile.am
   trunk/ompi/mca/btl/sctp/btl_sctp.c
   trunk/ompi/mca/btl/sctp/btl_sctp.h
   trunk/ompi/mca/btl/sctp/btl_sctp_addr.h
   trunk/ompi/mca/btl/sctp/btl_sctp_component.c
   trunk/ompi/mca/btl/sctp/btl_sctp_component.h
   trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.c
   trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.h
   trunk/ompi/mca/btl/sctp/btl_sctp_frag.c
   trunk/ompi/mca/btl/sctp/btl_sctp_frag.h
   trunk/ompi/mca/btl/sctp/btl_sctp_hdr.h
   trunk/ompi/mca/btl/sctp/btl_sctp_proc.c
   trunk/ompi/mca/btl/sctp/btl_sctp_proc.h
   trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.c
   trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.h
   trunk/ompi/mca/btl/sctp/btl_sctp_utils.c
   trunk/ompi/mca/btl/sctp/btl_sctp_utils.h
   trunk/ompi/mca/btl/sctp/configure.m4
   trunk/ompi/mca/btl/sctp/configure.params
   trunk/ompi/mca/btl/sctp/sctp_writev.c
   trunk/ompi/mca/btl/sctp/sctp_writev.h


Diff not shown due to size (201438 bytes).
To see the diff, run the following command:

svn diff -r 16722:16723 --no-diff-deleted

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn


Re: [OMPI devel] [OMPI svn] svn:open-mpi r16723

2007-11-14 Thread Gleb Natapov
On Wed, Nov 14, 2007 at 06:44:06AM -0800, Tim Prins wrote:
> Hi,
> 
> The following files bother me about this commit:
>  trunk/ompi/mca/btl/sctp/sctp_writev.c
>  trunk/ompi/mca/btl/sctp/sctp_writev.h
> 
> They bother me for 2 reasons:
> 1. Their naming does not follow the prefix rule
> 2. They are LGPL licensed. While I personally like the LGPL, I do not 
> believe it is compatible with the BSD license that OMPI is distributed 
> under. I think (though I could be wrong) that these files need to be 
> removed from the repository and the functionality implemented in some 
> other way.

Is function that fills a couple of struct fields can be reimplemented in
any other way? :)

> 
> Tim
> 
> 
> pen...@osl.iu.edu wrote:
> > Author: penoff
> > Date: 2007-11-13 18:39:16 EST (Tue, 13 Nov 2007)
> > New Revision: 16723
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/16723
> > 
> > Log:
> > initial SCTP BTL commit
> > Added:
> >trunk/ompi/mca/btl/sctp/
> >trunk/ompi/mca/btl/sctp/.ompi_ignore
> >trunk/ompi/mca/btl/sctp/.ompi_unignore
> >trunk/ompi/mca/btl/sctp/Makefile.am
> >trunk/ompi/mca/btl/sctp/btl_sctp.c
> >trunk/ompi/mca/btl/sctp/btl_sctp.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_addr.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_component.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_component.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_frag.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_frag.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_hdr.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_proc.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_proc.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_utils.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_utils.h
> >trunk/ompi/mca/btl/sctp/configure.m4
> >trunk/ompi/mca/btl/sctp/configure.params
> >trunk/ompi/mca/btl/sctp/sctp_writev.c
> >trunk/ompi/mca/btl/sctp/sctp_writev.h
> > 
> > 
> > Diff not shown due to size (201438 bytes).
> > To see the diff, run the following command:
> > 
> > svn diff -r 16722:16723 --no-diff-deleted
> > 
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] [OMPI svn] svn:open-mpi r16723

2007-11-14 Thread Jeff Squyres (jsquyres)
Tim - excellent catch!

I totally agree.  We must be very mindful of IP-related issues.

-jms
Sent from my PDA

 -Original Message-
From:   Tim Prins [mailto:tpr...@cs.indiana.edu]
Sent:   Wednesday, November 14, 2007 09:44 AM Eastern Standard Time
To: de...@open-mpi.org
Subject:Re: [OMPI devel] [OMPI svn] svn:open-mpi r16723

Hi,

The following files bother me about this commit:
 trunk/ompi/mca/btl/sctp/sctp_writev.c
 trunk/ompi/mca/btl/sctp/sctp_writev.h

They bother me for 2 reasons:
1. Their naming does not follow the prefix rule
2. They are LGPL licensed. While I personally like the LGPL, I do not 
believe it is compatible with the BSD license that OMPI is distributed 
under. I think (though I could be wrong) that these files need to be 
removed from the repository and the functionality implemented in some 
other way.

Tim


pen...@osl.iu.edu wrote:
> Author: penoff
> Date: 2007-11-13 18:39:16 EST (Tue, 13 Nov 2007)
> New Revision: 16723
> URL: https://svn.open-mpi.org/trac/ompi/changeset/16723
> 
> Log:
> initial SCTP BTL commit
> Added:
>trunk/ompi/mca/btl/sctp/
>trunk/ompi/mca/btl/sctp/.ompi_ignore
>trunk/ompi/mca/btl/sctp/.ompi_unignore
>trunk/ompi/mca/btl/sctp/Makefile.am
>trunk/ompi/mca/btl/sctp/btl_sctp.c
>trunk/ompi/mca/btl/sctp/btl_sctp.h
>trunk/ompi/mca/btl/sctp/btl_sctp_addr.h
>trunk/ompi/mca/btl/sctp/btl_sctp_component.c
>trunk/ompi/mca/btl/sctp/btl_sctp_component.h
>trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.c
>trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.h
>trunk/ompi/mca/btl/sctp/btl_sctp_frag.c
>trunk/ompi/mca/btl/sctp/btl_sctp_frag.h
>trunk/ompi/mca/btl/sctp/btl_sctp_hdr.h
>trunk/ompi/mca/btl/sctp/btl_sctp_proc.c
>trunk/ompi/mca/btl/sctp/btl_sctp_proc.h
>trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.c
>trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.h
>trunk/ompi/mca/btl/sctp/btl_sctp_utils.c
>trunk/ompi/mca/btl/sctp/btl_sctp_utils.h
>trunk/ompi/mca/btl/sctp/configure.m4
>trunk/ompi/mca/btl/sctp/configure.params
>trunk/ompi/mca/btl/sctp/sctp_writev.c
>trunk/ompi/mca/btl/sctp/sctp_writev.h
> 
> 
> Diff not shown due to size (201438 bytes).
> To see the diff, run the following command:
> 
>   svn diff -r 16722:16723 --no-diff-deleted
> 
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] initial SCTP BTL commit comments?

2007-11-14 Thread Brad Penoff
On Nov 14, 2007 5:11 AM, Terry Dontje  wrote:
>
> Brad Penoff wrote:
> > On Nov 12, 2007 3:26 AM, Jeff Squyres  wrote:
> >
> >> I have no objections to bringing this into the trunk, but I agree that
> >> an .ompi_ignore is probably a good idea at first.
> >>
> >
> > I'll try to cook up a commit soon then!
> >
> >
> >> One question that I'd like to have answered is how OMPI decides
> >> whether to use the SCTP BTL or not.  If there are SCTP stacks
> >> available by default in Linux and OS X -- but their performance may be
> >> sub-optimal and/or buggy, we may want to have the SCTP BTL only
> >> activated if the user explicitly asks for it.  Open MPI is very
> >> concerned with "out of the box" behavior -- we need to ensure that
> >> "mpirun a.out" will "just work" on all of our supported platforms.
> >>
> >
> > Just to make a few things explicit...
> >
> > Things would only work out of the box on FreeBSD, and there the stack
> > is very good.
> >
> > We have less experience with the Linux stack but hope the availability
> > of and SCTP BTL will help encourage its use by us and others.  Now it
> > is a module by default (loaded with "modprobe sctp") but the actual
> > SCTP sockets extension API needs to be downloaded and installed
> > separately.  The so-called lksctp-tools can be obtained here:
> > http://sourceforge.net/project/showfiles.php?group_id=26529
> >
> > The OS X stack does not come by default but instead is a kernel extension:
> > http://sctp.fh-muenster.de/sctp-nke.html
> > I haven't yet started this testing but intend to soon.  As of now
> > though, the supplied configure.m4 does not try to even build the
> > component on Mac OS X.
> >
> > So in my opinion, things in the configure scripts should be fine the
> > way the are since only FreeBSD stack (which we have confidence in)
> > will try to work out of the box; the others require the user to
> > install things.
> >

Greetings,

> I am gathering from the text above you haven't tried your BTL on Solaris
> at all.

The short answer to that is correct, we haven't tried the Open MPI
SCTP BTL yet on Solaris.  In fact, the configure.m4 file checks the
$host value and only tries to build if it's on Linux or a BSD variant.
 Mac OS X uses the same code as BSD but I have only just got my hands
on a machine so even it hasn't been tested yet; Solaris remains on the
TODO list.

However, there's a slightly longer answer...

After a series of emails with the Sun SCTP people
(sctp-questi...@sun.com but mostly Kacheong Poon) a year ago, I
learned SCTP support is within Solaris 10 by default.  In general,
SCTP supports its own socket API, in addition to the standard Berkeley
sockets API; the SCTP-specific sockets API unlocks some of SCTP's
newer features (e.g, multistreaming).  We make use of this
SCTP-specific sockets API.

The Solaris stack (as of a year ago) made certain assumptions about
the SCTP-specific sockets API.  I'm just looking back on those emails
now to refresh my memory... it looks like on the Solaris stack as of
Nov 2006, it did not allow the use one-to-many sockets (the current
default in our BTL) together with the sctp_sendmsg call.  They
mentioned an alternative just we didn't have the time to explore it.
I'm not sure if this has changed on the Solaris stack within the past
year... I never got the time to revisit this.

In the past, we had mostly used the one-to-many socket (with our LAM
and MPICH2 versions).  One unique thing about this Open MPI SCTP BTL
is that there is also a choice to make use of (the more TCP-like)
one-to-one socket style.  The socket style used by the SCTP BTL is
adjustable with the MCA parameter btl_sctp_if_11 (if set to 1, it uses
1-1 sockets; by default it is 0 and uses 1-many).  I've never used
one-to-one sockets on the Solaris stack, but it may have a better
chance of working (also one-to-many may work now; I haven't kept
up-to-date).

We also noticed that on Solaris we had to do some things a little
different with iovec's because the struct msghdr (used by sendmsg) had
no msg_control field; to get around this, we had to pack the iovec's
contents into a buffer and send that buffer instead of using the iovec
directly.

Anyway, hope this fully answers your questions.  In general, it'd be
nice if we have the time/assistance to add in Solaris support
eventually.

brad

>
> --td
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] Multi-Rail and Open IB BTL

2007-11-14 Thread Don Kerr



Jeff Squyres wrote:


On Nov 9, 2007, at 1:24 PM, Don Kerr wrote:

 

both, I was thinking of listing what I think are multi-rail  
requirements

but wanted to understand what the current state of things are
   



I believe the OF portion of the FAQ describes what we do in the v1.2  
series (right Gleb?); I honestly don't remember what we do today on  
the trunk (I'm pretty sure that Gleb has tweaked it recently).
 


Gleb's response answered this.


As for what we *should* do, it's a very complicated question.  :-\
 

OK. I knew the "close to NIC" was a concern but was not aware an attempt 
to tackle this began. I will look at the "carto" framework.


Thanks
-DON

This is where all these discussions regarding affinity, NUMA, and NUNA  
(non uniform network architecture) come into play.  A "very simple"  
scenario may be something like this:


- host A is UMA (perhaps even a uniprocessor) with 2 ports that are  
equidistant from the 1 MPI process on that host
- host B is the same, except it only has 1 active port on the same IB  
subnet as host A's 2 ports

- the ports on both hosts are all the same speed (e.g., DDR)
- the ports all share a single, common, non-blocking switch

But even with this "simple" case, the answer as to what you should do  
is still unclear.  If host A is able to drive both of its DDR links at  
full speed, you're could cause congestion at the link to host B if the  
MPI process on host A opens two connections.  But if host A is only  
able to drive the same effective bandwidth out of its two ports as it  
is through a single port, then the end effect is probably fairly  
negligible -- it might not make much of a difference at all as to  
whether the MPI process A opens 1 or 2 connections to host B.


But then throw in other effects that I mentioned above (NUMA, NUNA,  
etc.), and the equation becomes much more complex.  In some cases, it  
may be good to open 1 connection (e.g., bandwidth load balancing); in  
other cases it may be good to open 2 (e.g., congestion avoidance /  
spreading traffic around the network, particularly in the presence of  
other MPI jobs on the network).  :-\


Such NUNA architectures may sound unusual to some, but both IBM and HP  
sell [many] blade-based HPC solutions with NUNA internal IB networks.   
Specifically: this is a fairly common scenario.


So this is a difficult question without a great answer.  The hope is  
that the new carto framework that Sharon sent requirements around for  
will be able to at least make topology information available from both  
the host and the network so that BTLs can possibly make some  
intelligent decisions about what to do in these kinds of scenarios.