Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Gleb Natapov
On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
> The situation that needs to be triggered, just as George has mentions, is
> where we have a lot of unexpected messages, to make sure that when one that
> we can match against comes in, all the unexpected messages that can be
> matched with pre-posted receives are matched.  Since we attempt to match
> only when a new fragment comes in, we need to make sure that we don't leave
> other unexpected messages that can be matched in the unexpected queue, as
> these (if the out of order scenario is just right) would block any new
> matches from occurring.
> 
> For example:  Say the next expect message is 25
> 
> Unexpected message queue has:  26 28 29 ..
> 
> If 25 comes in, and is handled, if 26 is not pulled off the unexpected
> message queue, when 27 comes in it won't be able to be matched, as 26 is
> sitting in the unexpected queue, and will never be looked at again ...
This situation is triggered constantly with openib BTL. OpenIB BTL has
two ways to receive a packet: over a send queue or over an eager RDMA path.
Receiver polls both of them and may reorders packets locally. Actually
currently there is a bug in openib BTL that one channel may starve the other
at the receiver so if a match fragment with a next sequence number is in the
starved path tenth of thousands fragment can be reorederd. Test case attached
to ticket #1158 triggers this case and my patch handles all reordered packets.

And, by the way, the code is much simpler now and can be review easily ;)

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Richard Graham
If you have positive confirmation that such things have happened, this will
go a long way.  I will not trust the code until this has also been done with
multiple independent network paths.  I very rarely express such strong
opinions, even if I don't agree with what is being done, but this is the
core of correct MPI functionality, and first hand experience has shown that
just thinking through the logic, I can miss some of the race conditions.
The code here has been running for 8+ years in two production MPI's running
on very large clusters, so I am very reluctant to make changes for what
seems to amount to people's taste - maintenance is not an issue in this
case.  Had this not been such a key bit of code, I would not even bat an
eye.  I suppose if you can go through some formal verification, this would
also be good - actually better than hoping that one will hit out-of-order
situations.

Rich


On 12/14/07 2:20 AM, "Gleb Natapov"  wrote:

> On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
>> The situation that needs to be triggered, just as George has mentions, is
>> where we have a lot of unexpected messages, to make sure that when one that
>> we can match against comes in, all the unexpected messages that can be
>> matched with pre-posted receives are matched.  Since we attempt to match
>> only when a new fragment comes in, we need to make sure that we don't leave
>> other unexpected messages that can be matched in the unexpected queue, as
>> these (if the out of order scenario is just right) would block any new
>> matches from occurring.
>> 
>> For example:  Say the next expect message is 25
>> 
>> Unexpected message queue has:  26 28 29 ..
>> 
>> If 25 comes in, and is handled, if 26 is not pulled off the unexpected
>> message queue, when 27 comes in it won't be able to be matched, as 26 is
>> sitting in the unexpected queue, and will never be looked at again ...
> This situation is triggered constantly with openib BTL. OpenIB BTL has
> two ways to receive a packet: over a send queue or over an eager RDMA path.
> Receiver polls both of them and may reorders packets locally. Actually
> currently there is a bug in openib BTL that one channel may starve the other
> at the receiver so if a match fragment with a next sequence number is in the
> starved path tenth of thousands fragment can be reorederd. Test case attached
> to ticket #1158 triggers this case and my patch handles all reordered packets.
> 
> And, by the way, the code is much simpler now and can be review easily ;)
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] IPv4 mapped IPv6 addresses

2007-12-14 Thread Adrian Knoth
Hi!

The current BTL/TCP and OOB/TCP code contains separate sockets for IPv4
and IPv6. Though it has never been a problem for me, this might cause an
out-of-FDs-error in large clusters. (IIRC, rhc has already pointed out
this issue)

A possible way to reduce FD consumption would be the use of IPv4 mapped
IPv6 addresses. These addresses let one use a single AF_INET6 socket for
both, IPv4 and IPv6.

One year ago, I've chosen not to employ these addresses for mainly two
reasons:

   - Windows XP doesn't support them
   - OpenBSD has disabled them, but the system administrator can enable
 them at runtime

These limitions are also mentioned here: 

   http://en.wikipedia.org/wiki/IPv4_mapped_address#Limitations

Nowadays, Vista (and the Windows Server line) has support for
IPv4-mapped IPv6 addresses.

If disabled on OpenBSD systems, the code wouldn't be able to do IPv4,
but as already mentioned, the admin could easily fix this.

Should we consider moving towards these mapped addresses? The
implications:

   - less code, only one socket to handle
   - better FD consumption
   - breaks WinXP support, but not Vista/Longhorn or later
   - requires non-default kernel runtime setting on OpenBSD for IPv4
 connections

FWIW, FD consumption is the only real issue to consider.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] IPv4 mapped IPv6 addresses

2007-12-14 Thread Brian W. Barrett

On Fri, 14 Dec 2007, Adrian Knoth wrote:


Should we consider moving towards these mapped addresses? The
implications:

  - less code, only one socket to handle
  - better FD consumption
  - breaks WinXP support, but not Vista/Longhorn or later
  - requires non-default kernel runtime setting on OpenBSD for IPv4
connections

FWIW, FD consumption is the only real issue to consider.


My thought is no.  The resource consumption isn't really an issue to 
consider.  It would also simplify the code (although work that Adrian and 
I did later to clean up the TCP OOB component has limited that).  If you 
look at the FD count issue, you're going to reduce the number of FDs (for 
the OOB anyway) by 2.  Not (2 * NumNodes), but 2 (one for BTL, one for 
OOB).  Today we have a listen socket for IPv4 and another for IPv6.  With 
IPv4 mapped addresses, we'd have one that did both.  In terms of per-peer 
connections, the OOB tries one connection at a time, so there will be at 
most 1 OOB connection between any two peers.


In return for 2 FDs, we'd have to play with code taht we know works and 
with cleanups over the last year has actually become quite simple.  We'd 
have to break WinXP support (when it sounds like no one is really moving 
to Vista), and we'd break out-of-the-box OpenBSD.


Brian


[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r16959

2007-12-14 Thread Jeff Squyres
This commit does what we previously discussed: it only compiles the  
XOOB openib CPC if XRC support is actually present (vs. having a stub  
XOOB when XRC is not present).  This is on the /tmp-public/openib-cpc  
branch.


I have some hermon hca's, but due to dumb issues, I don't have XRC- 
capable OFED on those nodes yet.  It'll probably take me a few more  
days before I have that ready.


Could someone try the openib-cpc tmp branch and ensure I didn't break  
the case where XRC support is available?  It is easy to tell if the  
XOOB CPC compiled in -- run this command:


ompi_info --param btl openib --parsable | grep xoob

If the output is empty, then XOOB was not compiled in.  If you see  
output, then XOOB was compiled in.


Thanks!



Begin forwarded message:


From: jsquy...@osl.iu.edu
Date: December 14, 2007 12:10:24 PM EST
To: svn-f...@open-mpi.org
Subject: [OMPI svn-full] svn:open-mpi r16959
Reply-To: de...@open-mpi.org

Author: jsquyres
Date: 2007-12-14 12:10:23 EST (Fri, 14 Dec 2007)
New Revision: 16959
URL: https://svn.open-mpi.org/trac/ompi/changeset/16959

Log:
Only compile in the XOOB CPC if a) configure found that we have XRC
support available and b) the user didn't disable connectx support.

Text files modified:
  tmp-public/openib-cpc/config/ 
ompi_check_openib.m4   | 3 ++-
  tmp-public/openib-cpc/ompi/mca/btl/openib/ 
Makefile.am   | 8 ++--
  tmp-public/openib-cpc/ompi/mca/btl/openib/ 
configure.m4  | 8 
  tmp-public/openib-cpc/ompi/mca/btl/openib/connect/ 
btl_openib_connect_base.c | 2 ++
  tmp-public/openib-cpc/ompi/mca/btl/openib/connect/ 
btl_openib_connect_xoob.c |23 ---

  5 files changed, 18 insertions(+), 26 deletions(-)

Modified: tmp-public/openib-cpc/config/ompi_check_openib.m4
= 
= 
= 
= 
= 
= 
= 
= 
==

--- tmp-public/openib-cpc/config/ompi_check_openib.m4   (original)
+++ tmp-public/openib-cpc/config/ompi_check_openib.m4	2007-12-14  
12:10:23 EST (Fri, 14 Dec 2007)

@@ -102,7 +102,8 @@
AS_IF([test "$ompi_check_openib_happy" = "yes"],
  [AC_CHECK_DECLS([IBV_EVENT_CLIENT_REREGISTER], [], [],
  [#include ])
-   AC_CHECK_FUNCS([ibv_get_device_list ibv_resize_cq  
ibv_open_xrc_domain])])

+   AC_CHECK_FUNCS([ibv_get_device_list ibv_resize_cq])
+   AC_CHECK_FUNCS([ibv_open_xrc_domain], [$1_have_xrc=1])])

CPPFLAGS="$ompi_check_openib_$1_save_CPPFLAGS"
LDFLAGS="$ompi_check_openib_$1_save_LDFLAGS"

Modified: tmp-public/openib-cpc/ompi/mca/btl/openib/Makefile.am
= 
= 
= 
= 
= 
= 
= 
= 
==

--- tmp-public/openib-cpc/ompi/mca/btl/openib/Makefile.am   (original)
+++ tmp-public/openib-cpc/ompi/mca/btl/openib/Makefile.am	2007-12-14  
12:10:23 EST (Fri, 14 Dec 2007)

@@ -55,14 +55,18 @@
connect/btl_openib_connect_base.c \
connect/btl_openib_connect_oob.c \
connect/btl_openib_connect_oob.h \
-connect/btl_openib_connect_xoob.c \
-connect/btl_openib_connect_xoob.h \
connect/btl_openib_connect_rdma_cm.c \
connect/btl_openib_connect_rdma_cm.h \
connect/btl_openib_connect_ibcm.c \
connect/btl_openib_connect_ibcm.h \
connect/connect.h

+if MCA_btl_openib_have_xrc
+sources += \
+connect/btl_openib_connect_xoob.c \
+connect/btl_openib_connect_xoob.h
+endif
+
# Make the output library in this directory, and name it either
# mca__.la (for DSO builds) or libmca__.la
# (for static builds).

Modified: tmp-public/openib-cpc/ompi/mca/btl/openib/configure.m4
= 
= 
= 
= 
= 
= 
= 
= 
==

--- tmp-public/openib-cpc/ompi/mca/btl/openib/configure.m4  (original)
+++ tmp-public/openib-cpc/ompi/mca/btl/openib/configure.m4	 
2007-12-14 12:10:23 EST (Fri, 14 Dec 2007)

@@ -18,6 +18,14 @@
# $HEADER$
#

+# MCA_btl_openib_POST_CONFIG([should_build])
+# --
+AC_DEFUN([MCA_btl_openib_POST_CONFIG], [
+AS_IF([test $1 -eq 0 -a "$enable_dist" = "yes"],
+  [AC_MSG_ERROR([BTL openib is disabled but --enable-dist  
specifed.  This will result in a bad tarball.  Aborting configure.])])
+AM_CONDITIONAL([MCA_btl_openib_have_xrc], [test $1 -eq 1 -a "x 
$btl_openib_have_xrc" = "x1" -a "x$ompi_want_connectx_xrc" = "x1"])

+])
+

# MCA_btl_openib_CONFIG([action-if-can-compile],
#  [action-if-cant-compile])

Modified: tmp-public/openib-cpc/ompi/mca/btl/openib/connect/ 
btl_openib_connect_base.c
= 
= 
= 
= 
= 
= 
= 
= 
==
--- tmp-public/openib-cpc/ompi/mca/btl/openib/connect/ 
btl_openib_connect_base.c	(original)
+++ tmp-public/openib-cpc/ompi/mca/btl/openib/connect/ 
btl_openib_connect_base.c	2007-12-14 12:10:23 EST (Fri, 14 Dec 2007)

@@ -34,7 +34,9 @@
 */

[OMPI devel] Gnu #ident

2007-12-14 Thread Jeff Squyres
We recently put in a feature on the trunk to put in ident strings in  
the 3 OMPI libraries.


Short version:
--

We use "#ident foo" for GNU, but it emits a stderr warning that it's a  
GCC extension when used with -pedantic (we check that #ident works in  
configure).  Does anyone care?  Should we add an option to disable  
#ident for GNU compilers?


Longer version:
---

configure checks for 2 things:

1. Does "#pragma ident " work
2. Does "#ident " work

If either of these work (in order), they are used.  If neither work, a  
static const char[] is used.  This feature was added as part of the  
"branding" strategy, and also to help Sun with debugging because they  
have a handy command (ident(1)?  I don't remember offhand) that will  
look for the #pragma ident in a library and print it out.  Good for  
support issues.


However, the GNU compilers support #ident but apparently print a  
stderr warning about it when -pedantic is used (which we automatically  
enable for debugging builds via --enable-picky).


Does anyone care?  Should we put in an option to turn off #ident for  
the GNU compilers (perhaps only when -pedantic is used)?


I ask because a few people have noticed the new output on stderr and  
asked me about it, so I figured I'd raise the "does anyone care?"  
flag...


--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Gleb Natapov
On Fri, Dec 14, 2007 at 06:53:55AM -0500, Richard Graham wrote:
> If you have positive confirmation that such things have happened, this will
> go a long way.
I instrumented the code to log all kind of info about fragment reordering while
I chased a bug in openib that caused matching logic to malfunction. Any
non trivial application that uses OpenIB BTL will have reordered fragments.
(I wish this would not be that case, but I don't have a solution yet).

> I will not trust the code until this has also been done with
> multiple independent network paths. 
I ran IMB over IP and IB simultaneously on more then 80 ranks.

>  I very rarely express such strong
> opinions, even if I don't agree with what is being done, but this is the
> core of correct MPI functionality, and first hand experience has shown that
I agree that this is indeed very important piece of code, but it certain
is not more important than data type engine for instance (and it is much
easier to test all corner cases in matching logic than in data type engine
IMHO). And event if matching code works perfectly, but other parts of
OB1 are buggy the Open MPI will not work properly, so why this code is
chosen to be a sacred cow?

> just thinking through the logic, I can miss some of the race conditions.
That is of cause correct, but the more people will look at the code the
better, isn't it?

> The code here has been running for 8+ years in two production MPI's running
> on very large clusters, so I am very reluctant to make changes for what
Are you sure about this? I see a number of changes to this code during
Open MPI development and current SVN does not hold all the history of
this code unfortunately. Here is the list of commits that I found, part
of them change the code logic quite a bit:
r6770,r7342,r8339,r8352,r8353,r8356,r8946,r11874,r12323,r12582

> seems to amount to people's taste - maintenance is not an issue in this
> case.  Had this not been such a key bit of code, I would not even bat an
Why do you think that maintenance is not an issue? It is for me. Otherwise
I wouldn't even look at this part of code. All those macros prohibit the use
of a debugger for instance.

(And I see a small latency improvement too :))

> eye.  I suppose if you can go through some formal verification, this would
> also be good - actually better than hoping that one will hit out-of-order
> situations.
> 
> Rich
> 
> 
> On 12/14/07 2:20 AM, "Gleb Natapov"  wrote:
> 
> > On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
> >> The situation that needs to be triggered, just as George has mentions, is
> >> where we have a lot of unexpected messages, to make sure that when one that
> >> we can match against comes in, all the unexpected messages that can be
> >> matched with pre-posted receives are matched.  Since we attempt to match
> >> only when a new fragment comes in, we need to make sure that we don't leave
> >> other unexpected messages that can be matched in the unexpected queue, as
> >> these (if the out of order scenario is just right) would block any new
> >> matches from occurring.
> >> 
> >> For example:  Say the next expect message is 25
> >> 
> >> Unexpected message queue has:  26 28 29 ..
> >> 
> >> If 25 comes in, and is handled, if 26 is not pulled off the unexpected
> >> message queue, when 27 comes in it won't be able to be matched, as 26 is
> >> sitting in the unexpected queue, and will never be looked at again ...
> > This situation is triggered constantly with openib BTL. OpenIB BTL has
> > two ways to receive a packet: over a send queue or over an eager RDMA path.
> > Receiver polls both of them and may reorders packets locally. Actually
> > currently there is a bug in openib BTL that one channel may starve the other
> > at the receiver so if a match fragment with a next sequence number is in the
> > starved path tenth of thousands fragment can be reorederd. Test case 
> > attached
> > to ticket #1158 triggers this case and my patch handles all reordered 
> > packets.
> > 
> > And, by the way, the code is much simpler now and can be review easily ;)
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] New BTL parameter

2007-12-14 Thread Gleb Natapov
If there is no objection I will commit this to the trunk next week.

On Sun, Dec 09, 2007 at 05:34:30PM +0200, Gleb Natapov wrote:
> Hi,
> 
>   Currently BTL has parameter btl_min_send_size that is no longer used.
> I want to change it to be btl_rndv_eager_limit. This new parameter will
> determine a size of a first fragment of rendezvous protocol. Now we use
> btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
> smaller or equal to btl_eager_limit. By default it will be equal to
> btl_eager_limit so no behavior change will be observed if default is
> used.
> 

--
Gleb.


Re: [OMPI devel] initial SCTP BTL commit comments?

2007-12-14 Thread Brad Penoff
On Nov 14, 2007 10:17 AM, Brad Penoff  wrote:
>
> On Nov 14, 2007 5:11 AM, Terry Dontje  wrote:
> >
> > Brad Penoff wrote:
> > > On Nov 12, 2007 3:26 AM, Jeff Squyres  wrote:
> > >
> > >> I have no objections to bringing this into the trunk, but I agree that
> > >> an .ompi_ignore is probably a good idea at first.
> > >>
> > >
> > > I'll try to cook up a commit soon then!
> > >
> > >
> > >> One question that I'd like to have answered is how OMPI decides
> > >> whether to use the SCTP BTL or not.  If there are SCTP stacks
> > >> available by default in Linux and OS X -- but their performance may be
> > >> sub-optimal and/or buggy, we may want to have the SCTP BTL only
> > >> activated if the user explicitly asks for it.  Open MPI is very
> > >> concerned with "out of the box" behavior -- we need to ensure that
> > >> "mpirun a.out" will "just work" on all of our supported platforms.
> > >>
> > >
> > > Just to make a few things explicit...
> > >
> > > Things would only work out of the box on FreeBSD, and there the stack
> > > is very good.
> > >
> > > We have less experience with the Linux stack but hope the availability
> > > of and SCTP BTL will help encourage its use by us and others.  Now it
> > > is a module by default (loaded with "modprobe sctp") but the actual
> > > SCTP sockets extension API needs to be downloaded and installed
> > > separately.  The so-called lksctp-tools can be obtained here:
> > > http://sourceforge.net/project/showfiles.php?group_id=26529
> > >
> > > The OS X stack does not come by default but instead is a kernel extension:
> > > http://sctp.fh-muenster.de/sctp-nke.html
> > > I haven't yet started this testing but intend to soon.  As of now
> > > though, the supplied configure.m4 does not try to even build the
> > > component on Mac OS X.
> > >
> > > So in my opinion, things in the configure scripts should be fine the
> > > way the are since only FreeBSD stack (which we have confidence in)
> > > will try to work out of the box; the others require the user to
> > > install things.
> > >
>
> Greetings,
>
> > I am gathering from the text above you haven't tried your BTL on Solaris
> > at all.
>
> The short answer to that is correct, we haven't tried the Open MPI
> SCTP BTL yet on Solaris.  In fact, the configure.m4 file checks the
> $host value and only tries to build if it's on Linux or a BSD variant.
>  Mac OS X uses the same code as BSD but I have only just got my hands
> on a machine so even it hasn't been tested yet; Solaris remains on the
> TODO list.
>
> However, there's a slightly longer answer...
>
> After a series of emails with the Sun SCTP people
> (sctp-questi...@sun.com but mostly Kacheong Poon) a year ago, I
> learned SCTP support is within Solaris 10 by default.  In general,
> SCTP supports its own socket API, in addition to the standard Berkeley
> sockets API; the SCTP-specific sockets API unlocks some of SCTP's
> newer features (e.g, multistreaming).  We make use of this
> SCTP-specific sockets API.
>
> The Solaris stack (as of a year ago) made certain assumptions about
> the SCTP-specific sockets API.  I'm just looking back on those emails
> now to refresh my memory... it looks like on the Solaris stack as of
> Nov 2006, it did not allow the use one-to-many sockets (the current
> default in our BTL) together with the sctp_sendmsg call.  They
> mentioned an alternative just we didn't have the time to explore it.
> I'm not sure if this has changed on the Solaris stack within the past
> year... I never got the time to revisit this.
>
> In the past, we had mostly used the one-to-many socket (with our LAM
> and MPICH2 versions).  One unique thing about this Open MPI SCTP BTL
> is that there is also a choice to make use of (the more TCP-like)
> one-to-one socket style.  The socket style used by the SCTP BTL is
> adjustable with the MCA parameter btl_sctp_if_11 (if set to 1, it uses
> 1-1 sockets; by default it is 0 and uses 1-many).  I've never used
> one-to-one sockets on the Solaris stack, but it may have a better
> chance of working (also one-to-many may work now; I haven't kept
> up-to-date).
>
> We also noticed that on Solaris we had to do some things a little
> different with iovec's because the struct msghdr (used by sendmsg) had
> no msg_control field; to get around this, we had to pack the iovec's
> contents into a buffer and send that buffer instead of using the iovec
> directly.
>
> Anyway, hope this fully answers your questions.  In general, it'd be
> nice if we have the time/assistance to add in Solaris support
> eventually.

With r16967, the SCTP BTL in ompi-trunk now tries to build on Solaris
if it finds the required libraries and structs.  I've done some sanity
tests on our Solaris box here and I haven't seen any problems.  Please
let me know if you see otherwise as these changes can be easily undone
(by commenting out the solaris case statement and contents of the SCTP
BTL configure.m4).

Following Jeff's suggestions, the default code path for Solaris