Re: [OMPI devel] openmpi-1.3rc4 build failure with qsnet4.30

2009-01-14 Thread Paul H. Hargrove

I can confirm that both 1.3rc6 and 1.2.9rc2 now build fine for me.

-Paul

George Bosilca wrote:

Paul,

Thanks for noticing the Elan problem. It appears we miss one patch in 
the 1.3 (https://svn.open-mpi.org/trac/ompi/changeset/20122). I'll 
fill a CMR asap.


  Thanks,
george.

On Jan 13, 2009, at 16:31 , Paul H. Hargrove wrote:

Since it looks like you guys are very close to release, I just 
grabbed the 1.3rc4 tarball to give it a spin.

Unfortunately, the elan BTL is not building:

$ ../configure --prefix= CC= CXX=to g++-4.3.2> FC=

...
$ make
...
Making all in mca/btl/elan
make[2]: Entering directory 
`/home/pcp1/phargrov/OpenMPI/openmpi-1.3rc4/BLD/ompi/mca/btl/elan'

depbase=`echo btl_elan.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
  /bin/sh ../../../../libtool --tag=CC   --mode=compile 
/usr/local/pkg/gcc-4.3.2/bin/gcc -DHAVE_CONFIG_H -I. 
-I../../../../../ompi/mca/btl/elan -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa   
-I../../../../.. -I../../../.. -I../../../../../opal/include 
-I../../../../../orte/include -I../../../../../ompi/include-O3 
-DNDEBUG -finline-functions -fno-strict-aliasing -pthread 
-fvisibility=hidden -MT btl_elan.lo -MD -MP -MF $depbase.Tpo -c -o 
btl_elan.lo ../../../../../ompi/mca/btl/elan/btl_elan.c &&\

  mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  /usr/local/pkg/gcc-4.3.2/bin/gcc -DHAVE_CONFIG_H 
-I. -I../../../../../ompi/mca/btl/elan -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa 
-I../../../../.. -I../../../.. -I../../../../../opal/include 
-I../../../../../orte/include -I../../../../../ompi/include -O3 
-DNDEBUG -finline-functions -fno-strict-aliasing -pthread 
-fvisibility=hidden -MT btl_elan.lo -MD -MP -MF .deps/btl_elan.Tpo -c 
../../../../../ompi/mca/btl/elan/btl_elan.c  -fPIC -DPIC -o 
.libs/btl_elan.o

In file included from /usr/include/qsnet/fence.h:116,
   from /usr/include/elan3/elan3.h:42,
   from ../../../../../ompi/mca/btl/elan/btl_elan.h:34,
   from ../../../../../ompi/mca/btl/elan/btl_elan.c:18:
/usr/include/asm/bitops.h:333:2: warning: #warning This includefile 
is not available on all architectures.
/usr/include/asm/bitops.h:334:2: warning: #warning Using kernel 
headers in userspace.
../../../../../ompi/mca/btl/elan/btl_elan.c: In function 
'mca_btl_elan_add_procs':
../../../../../ompi/mca/btl/elan/btl_elan.c:167: error: 
'ELAN_TPORT_USERCOPY_DISABLE' undeclared (first use in this function)
../../../../../ompi/mca/btl/elan/btl_elan.c:167: error: (Each 
undeclared identifier is reported only once
../../../../../ompi/mca/btl/elan/btl_elan.c:167: error: for each 
function it appears in.)
../../../../../ompi/mca/btl/elan/btl_elan.c: In function 
'mca_btl_elan_get':
../../../../../ompi/mca/btl/elan/btl_elan.c:551: warning: cast to 
pointer from integer of different size

make[2]: *** [btl_elan.lo] Error 1
make[2]: Leaving directory 
`/home/pcp1/phargrov/OpenMPI/openmpi-1.3rc4/BLD/ompi/mca/btl/elan'

make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/home/pcp1/phargrov/OpenMPI/openmpi-1.3rc4/BLD/ompi'

make: *** [all-recursive] Error 1

$ rpm -qif /usr/include/qsnet   Name: 
qsnet-headersRelocations: (not relocateable)

Version : 4.30qsnet Vendor: (none)
Release : 0 Build Date: Mon 31 Jan 
2005 07:36:45 AM PST

Install date: Mon 13 Mar 2006 04:37:36 PM PST  Build Host: pingu
Group   : Development/SystemSource RPM: 
qsnet-headers-4.30qsnet-0.src.rpm

Size: 608924   License: GPL
Signature   : (none)
Summary : The QsNet header files for the qsnet Linux kernel.
Description :
The headers package contains the QsNet kernel headers which are
required by library programmers to use the QsNet hardware.


I couldn't find any info in the README about minimum supported 
version of qsnet.  However, I did notice a cut-and-paste error in the 
following text in README ("InfiniPath" should be "Elan"):


--with-elan=
Specify the directory where the Quadrics Elan library and header
files are located.  This option is generally only necessary if the
InfiniPath headers and libraries are not in default compiler/linker
search paths.


Sorry not to have done any testing earlier than today.

-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org

[OMPI devel] Open MPI v1.3rc6 has been posted

2009-01-14 Thread Tim Mattox
Hi All,
The sixth (yes 6!) release candidate of Open MPI v1.3 is now available:

 http://www.open-mpi.org/software/ompi/v1.3/

Please run it through it's paces as best you can.
Anticipated release of 1.3 is tomorrow morning.

This only has a fix for a segfault in coll_hierarch_component.c
with respect to rc5 (ticket #1751), so if you have already
started testing with rc5, and are not explicitly enabling
coll_hierarch, there is no need to start your tests over.
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Brad Benton
r20275 looks good.  I suggest that we CMR that into 1.3 and get rc6 rolled
and tested. (actually, Jeff just did the CMR...so off to rc6)
--brad


On Wed, Jan 14, 2009 at 1:16 PM, Edgar Gabriel  wrote:

> so I am not entirely sure why the bug only happened on trunk, it could in
> theory also appear on v1.3 (is there a difference on how pointer_arrays are
> handled between the two versions?)
>
> Anyway, it passes now on both with changeset 20275. We should probably move
> that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to
> others to decide...
>
> Thanks
> Edgar
>
>
> Edgar Gabriel wrote:
>
>> I'm already debugging it. the good news is that it only seems to appear
>> with trunk, with 1.3 (after copying the new tuned module over), all the
>> tests pass.
>>
>> Now if somebody can tell me a trick on how to tell mpirun not kill the
>> debugger under my feet, then I could even see where the problem occurs:-)
>>
>> Thanks
>> Edga
>>
>> George Bosilca wrote:
>>
>>> All these errors are in the MPI_Finalize, it should not be that hard to
>>> find. I'll take a look later this afternoon.
>>>
>>>  george.
>>>
>>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>>
>>>  Unfortunately, although this fixed some problems when enabling hierarch
 coll,
 there is still a segfault in two of IU's tests that only shows up when
 we set
 -mca coll_hierarch_priority 100

 See this MTT summary to see how the failures improved on the trunk,
 but that there are still two that segfault even at 1.4a1r20267:
 http://www.open-mpi.org/mtt/index.php?do_redir=923

 This link just has the remaining failures:
 http://www.open-mpi.org/mtt/index.php?do_redir=922

 So, I'll vote for applying the CMR for 1.3 since it clearly improved
 things,
 but there is still more to be done to get coll_hierarch ready for
 regular
 use.

 On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
> george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>  Let's debate tomorrow when people are around, but first you have to
>> file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>  Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective
>>> functions
>>> and changed all instances to use the correct module information. It
>>> is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit
>>> tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow
>>> people
>>> to use the hierarch. In the current incarnation 1.3 will
>>> mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken
>>> toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
>>>  Thanks for digging into this.  Can you file a bug?  Let's mark it
 for
 v1.3.1.

 I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
 and
 since hierarch isn't currently selected by default (you must
 specifically
 elevate hierarch's priority to get it to run), there's no danger
 that users
 will run into this problem in default runs.

 But clearly the problem needs to be fixed, and therefore we need a
 bug
 to track it.



 On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

  I just debugged the Reduce_scatter bug mentioned previously. The
> bug is
> unfortunately not in hierarch, but in tuned.
>
> Here is the code snipplet causing the problems:
>
> int reduce_scatter (, mca_coll_base_module_t *module)
> {
> ...
> err = comm->c_coll.coll_reduce (, module)
> ...
> }
>
>
> but should be
> {
> ...
> err = comm->c_coll.coll_reduce (...,
> comm->c_coll.coll_reduce_module);
> ...
> }
>
> The problem as it is right now is, that when using hierarch, only a
> subset of the function are set, e.g. reduce,allreduce, bcast and
> barrier.
> Thus, reduce_scatter is from tuned in most scenarios, and calls the
> subsequent functions with the wrong module. Hierarch of course does
> not like
> that :-)
>
> Anyway, a quick glance through the tuned code reveals a significant
> number of 

Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed

2009-01-14 Thread Matthias Jurenz
Sorry, I have searched the whole day for a solution of that problem, but
unfortunately, I'm clueless :-( I cannot say which flag causes the
compile error. Furthermore, I'm also unable to reproduce this error on
some different platforms.
The coding style in the concerned source file looks also not special...

My suggestion is to use the workaround (configure flag) for the 1.3
release. 

On Wed, 2009-01-14 at 07:57 -0500, Jeff Squyres wrote:
> Is there some code that can be fixed instead?  I.e., is this feature  
> totally incompatible with whatever RPM compiler flags are used, or is  
> it just some coding style that these particular flags don't like?
> 
> 
> On Jan 14, 2009, at 5:05 AM, Matthias Jurenz wrote:
> 
> > Another workaround should be to disable the I/O tracing feature of VT
> > by adding the configure option
> >
> > '--with-contrib-vt-flags=--disable-iotrace'
> >
> > That will have the effect that the upcoming OMPI-rpm's have no support
> > for I/O tracing, but in our opinion it is not so bad...
> >
> > Furthermore, we could add the configure option in
> > 'ompi/contrib/vt/configure.m4' to retain the feature-consistency  
> > between
> > the rpm's and the source packages.
> >
> >
> > Matthias
> >
> > On Tue, 2009-01-13 at 17:13 +0200, Lenny Verkhovsky wrote:
> >> I don't want to move changes ( default value of the flag), since  
> >> there
> >> are important people, for whom it works :)
> >> I also think that this is VT issue, but I guess we are the only one
> >> who experience the errors.
> >>
> >> we can now overwrite this params from the environment as a  
> >> workaround,
> >> Mike comitted buildrpm.sh script to the trunk r20253 that allows
> >> overwriting params from the environment.
> >>
> >> we observed the problem on CentOS 5.2 with boundled gcc and RedHat  
> >> 5.2
> >> with boundled gcc.
> >>
> >> #uname -a
> >> Linux elfit1 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
> >> x86_64 x86_64 GNU/Linux
> >>
> >> #lsb_release -a
> >> LSB Version:
> >> :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1- 
> >> amd64:graphics-3.1-ia32:graphics-3.1-noarch
> >> Distributor ID: CentOS
> >> Description:CentOS release 5.2 (Final)
> >> Release:5.2
> >> Codename:   Final
> >>
> >> gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
> >>
> >> Best regards,
> >> Lenny.
> >>
> >>
> >> On Tue, Jan 13, 2009 at 4:40 PM, Jeff Squyres   
> >> wrote:
> >>> I'm still guessing that this is a distro / compiler issue -- I can  
> >>> build
> >>> with the default flags just fine...?
> >>>
> >>> Can you specify what distro / compiler you were using?
> >>>
> >>> Also, if you want to move the changes that have been made to  
> >>> buildrpm.sh to
> >>> the v1.3 branch, just file a CMR.  That file is not included in  
> >>> release
> >>> tarballs, so Tim can move it over at any time.
> >>>
> >>>
> >>>
> >>> On Jan 13, 2009, at 6:35 AM, Lenny Verkhovsky wrote:
> >>>
>  it seems that setting use_default_rpm_opt_flags to 0 solves the  
>  problem.
>  Maybe vt developers should take a look on it.
> 
>  Lenny.
> 
> 
>  On Sun, Jan 11, 2009 at 2:40 PM, Jeff Squyres  
>   wrote:
> >
> > This sounds like a distro/compiler version issue.
> >
> > Can you narrow down the issue at all?
> >
> >
> > On Jan 11, 2009, at 3:23 AM, Lenny Verkhovsky wrote:
> >
> >> it doesnt happen if I do autogen, configure and make install,
> >> only when I try to make an rpm from the tar file.
> >>
> >>
> >>
> >> On Thu, Jan 8, 2009 at 9:43 PM, Jeff Squyres  
> >>  wrote:
> >>>
> >>> This doesn't happen in a normal build of the same tree?
> >>>
> >>> I ask because both 1.3r20226 builds fine for me manually (i.e.,
> >>> ./configure;make and buildrpm.sh).
> >>>
> >>>
> >>> On Jan 8, 2009, at 8:15 AM, Lenny Verkhovsky wrote:
> >>>
>  Hi,
> 
>  I am trying to build rpm from nightly snaposhots of 1.3
> 
>  with the downloaded buildrpm.sh and ompi.spec file from
>  http://svn.open-mpi.org/svn/ompi/branches/v1.3/contrib/dist/linux/
> 
>  I am getting this error
>  .
>  Making all in vtlib
>  make[5]: Entering directory
> 
>  `/hpc/home/USERS/lennyb/work/svn/release/scripts/dist-1.3--1/ 
>  OMPI/BUILD/
>  openmpi-1.3rc3r20226/ompi/contrib/vt/vt/vtlib'
>  gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
>  -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
>  -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
>  -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
>  -DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,- 
>  D_FORTIFY_SOURCE=2
>  -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
>  

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Edgar Gabriel
so I am not entirely sure why the bug only happened on trunk, it could 
in theory also appear on v1.3 (is there a difference on how 
pointer_arrays are handled between the two versions?)


Anyway, it passes now on both with changeset 20275. We should probably 
move that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that 
up to others to decide...


Thanks
Edgar

Edgar Gabriel wrote:
I'm already debugging it. the good news is that it only seems to appear 
with trunk, with 1.3 (after copying the new tuned module over), all the 
tests pass.


Now if somebody can tell me a trick on how to tell mpirun not kill the 
debugger under my feet, then I could even see where the problem occurs:-)


Thanks
Edga

George Bosilca wrote:
All these errors are in the MPI_Finalize, it should not be that hard 
to find. I'll take a look later this afternoon.


  george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

Unfortunately, although this fixed some problems when enabling 
hierarch coll,
there is still a segfault in two of IU's tests that only shows up 
when we set

-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved 
things,
but there is still more to be done to get coll_hierarch ready for 
regular

use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have to 
file a

CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:


Unfortunately, this pinpoint the fact that we didn't test enough the
collective module mixing thing. I went over the tuned collective 
functions
and changed all instances to use the correct module information. 
It is now

on the trunk, revision 20267. Simultaneously,I checked that all other
collective components do the right thing ... and I have to admit 
tuned was

the only faulty one.

This is clearly a bug in the tuned, and correcting it will allow 
people
to use the hierarch. In the current incarnation 1.3 will 
mostly/always
segfault when hierarch is active. I would prefer not to give a 
broken toy

out there. How about pushing r20267 in the 1.3?

george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark it 
for

v1.3.1.

I say 1.3.1 instead of 1.3.0 because this *only* affects 
hierarch, and
since hierarch isn't currently selected by default (you must 
specifically
elevate hierarch's priority to get it to run), there's no danger 
that users

will run into this problem in default runs.

But clearly the problem needs to be fixed, and therefore we need 
a bug

to track it.



On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The 
bug is

unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., 
comm->c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only a
subset of the function are set, e.g. reduce,allreduce, bcast and 
barrier.

Thus, reduce_scatter is from tuned in most scenarios, and calls the
subsequent functions with the wrong module. Hierarch of course 
does not like

that :-)

Anyway, a quick glance through the tuned code reveals a significant
number of instances where this appears(reduce_scatter, 
allreduce, allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly 
correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Brad Benton
So, if it looks okay on 1.3...then there should not be anything holding up
the release, right?  Otherwise, George we need to decide on whether or not
this is a blocker, or if we go ahead and release with this as a known issue
and schedule the fix for 1.3.1.  My vote is to go ahead and release, but if
you (or others) think otherwise, let's talk about how best to move forward.
--brad


On Wed, Jan 14, 2009 at 12:04 PM, Edgar Gabriel  wrote:

> I'm already debugging it. the good news is that it only seems to appear
> with trunk, with 1.3 (after copying the new tuned module over), all the
> tests pass.
>
> Now if somebody can tell me a trick on how to tell mpirun not kill the
> debugger under my feet, then I could even see where the problem occurs:-)
>
> Thanks
> Edga
>
>
> George Bosilca wrote:
>
>> All these errors are in the MPI_Finalize, it should not be that hard to
>> find. I'll take a look later this afternoon.
>>
>>  george.
>>
>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>
>>  Unfortunately, although this fixed some problems when enabling hierarch
>>> coll,
>>> there is still a segfault in two of IU's tests that only shows up when we
>>> set
>>> -mca coll_hierarch_priority 100
>>>
>>> See this MTT summary to see how the failures improved on the trunk,
>>> but that there are still two that segfault even at 1.4a1r20267:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=923
>>>
>>> This link just has the remaining failures:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=922
>>>
>>> So, I'll vote for applying the CMR for 1.3 since it clearly improved
>>> things,
>>> but there is still more to be done to get coll_hierarch ready for regular
>>> use.
>>>
>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
>>> wrote:
>>>
 Here we go by the book :)

 https://svn.open-mpi.org/trac/ompi/ticket/1749

 george.

 On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

  Let's debate tomorrow when people are around, but first you have to
> file a
> CMR... :-)
>
> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>
>  Unfortunately, this pinpoint the fact that we didn't test enough the
>> collective module mixing thing. I went over the tuned collective
>> functions
>> and changed all instances to use the correct module information. It is
>> now
>> on the trunk, revision 20267. Simultaneously,I checked that all other
>> collective components do the right thing ... and I have to admit tuned
>> was
>> the only faulty one.
>>
>> This is clearly a bug in the tuned, and correcting it will allow
>> people
>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>> segfault when hierarch is active. I would prefer not to give a broken
>> toy
>> out there. How about pushing r20267 in the 1.3?
>>
>> george.
>>
>>
>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>
>>  Thanks for digging into this.  Can you file a bug?  Let's mark it for
>>> v1.3.1.
>>>
>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>> and
>>> since hierarch isn't currently selected by default (you must
>>> specifically
>>> elevate hierarch's priority to get it to run), there's no danger that
>>> users
>>> will run into this problem in default runs.
>>>
>>> But clearly the problem needs to be fixed, and therefore we need a
>>> bug
>>> to track it.
>>>
>>>
>>>
>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>
>>>  I just debugged the Reduce_scatter bug mentioned previously. The bug
 is
 unfortunately not in hierarch, but in tuned.

 Here is the code snipplet causing the problems:

 int reduce_scatter (, mca_coll_base_module_t *module)
 {
 ...
 err = comm->c_coll.coll_reduce (, module)
 ...
 }


 but should be
 {
 ...
 err = comm->c_coll.coll_reduce (...,
 comm->c_coll.coll_reduce_module);
 ...
 }

 The problem as it is right now is, that when using hierarch, only a
 subset of the function are set, e.g. reduce,allreduce, bcast and
 barrier.
 Thus, reduce_scatter is from tuned in most scenarios, and calls the
 subsequent functions with the wrong module. Hierarch of course does
 not like
 that :-)

 Anyway, a quick glance through the tuned code reveals a significant
 number of instances where this appears(reduce_scatter, allreduce,
 allgather,
 allgatherv). Basic, hierarch and inter seem to do that mostly
 correctly.

 Thanks
 Edgar
 --
 Edgar Gabriel
 Assistant Professor
 Parallel Software Technologies Lab  

Re: [OMPI devel] crcpw verbosity

2009-01-14 Thread Josh Hursey
The crcpw component is in the PML framework. The following should be  
the MCA parameter you are looking for:

pml_crcpw_verbose=20

You can use the 'ompi_info' command to find out more information about  
MCA parameters available. For example to find this one you can use the  
following:

  ompi_info --param pml crcpw

Cheers,
Josh

On Jan 14, 2009, at 12:54 PM, Caciano Machado wrote:


Hi,

What variable should I set to increase the verbosity of crcpw  
component?


I've tried "ompi_crcpw_verbose=20" and "crcpw_base_verbose=20". How
can I figure out the name of the variable.

Regards,
Caciano
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] autosizing the shared memory backing file

2009-01-14 Thread Eugene Loh
I think you'd like to know more than just how many procs are local.  
E.g., if the chunk or eager limits are changed much, that would impact 
how much memory you'd like to allocate.


A phone chat is all right for me, though so far all I've heard is that 
no one understands the code!


But, maybe we can nip this one in the bud.  How about the following 
proposal.


First, what's happening is:
*) the sm BTL (which knows how big the file should be)
calls
*) mca_mpool_base_module_create()
calls
*) mca_mpool_sm_init() (which creates the file)

There is no explicit calling argument to transmit an mpool size through 
these function calls, but there is a "resources" argument.  This 
resources argument appears to be opaque to the intervening function, but 
it seems to be understood by both the sm BTL caller and the sm mpool 
component callee.  Other components appear to have other definitions of 
the resources data structure.


So, I propose:

*)  In mca/mpool/sm/mpool_sm.h, there is a definition of 
mca_mpool_base_resources_t.  It has a single field (int32_t mem_node).  
How about I add another field here:  size_t size.


*)  In the sm BTL in sm_btl_first_time_init(), I can set the size of the 
mmap file in my "resources" data structure.


*)  In mca_mpool_sm_init(), when I determine the mmap file size, I just 
look up the resources->size value and use that.


Yes?  Clean and proper solution?  Does not break other BTLs?

Ralph Castain wrote:

I also know little about that part of the code, but agree that does  
seem weird. Seeing as we know how many local procs there are before 
we  get to this point, I would think we could be smart about our 
memory  pool size. You might not need to dive into the sm BTL to get 
the info  you need - if all you need is how many procs are local, that 
can be  obtained fairly easily.


Be happy to contribute to the chat, if it would be helpful.

On Jan 14, 2009, at 7:43 AM, Jeff Squyres wrote:


Would it be useful to get on the phone and discuss this stuff?

On Jan 14, 2009, at 1:11 AM, Eugene Loh wrote:

Thanks for the reply.  I kind of understand, but it's rather  
weird.  The BTL calls mca_mpool_base_module_create() to create a  
pool of memory, but the BTL has no say how big of a pool to  create?


E.g., I see that there is a "resources" argument  
(mca_mpool_base_resources_t).  Maybe that structure should be  
expanded to include a "size" field?



On Jan 13, 2009, at 19:22 , Eugene Loh wrote:

With the sm BTL, there is a file that each process mmaps in for 
shared memory.


I'm trying to get mpool_sm to size the file appropriately.  
mpool_sm creates and mmaps  the file, but  the size depends on 
parameters like eager limit  and max frag size  that are known by 
the btl_sm.




[OMPI devel] Open MPI v1.3rc5 has been posted

2009-01-14 Thread Tim Mattox
Hi All,
The fifth release candidate of Open MPI v1.3 is now available:

 http://www.open-mpi.org/software/ompi/v1.3/

Please run it through it's paces as best you can.
Anticipated release of 1.3 is tonight/tomorrow. (again)
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Edgar Gabriel
I'm already debugging it. the good news is that it only seems to appear 
with trunk, with 1.3 (after copying the new tuned module over), all the 
tests pass.


Now if somebody can tell me a trick on how to tell mpirun not kill the 
debugger under my feet, then I could even see where the problem occurs:-)


Thanks
Edga

George Bosilca wrote:
All these errors are in the MPI_Finalize, it should not be that hard to 
find. I'll take a look later this afternoon.


  george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

Unfortunately, although this fixed some problems when enabling 
hierarch coll,
there is still a segfault in two of IU's tests that only shows up when 
we set

-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved 
things,

but there is still more to be done to get coll_hierarch ready for regular
use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have to 
file a

CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:


Unfortunately, this pinpoint the fact that we didn't test enough the
collective module mixing thing. I went over the tuned collective 
functions
and changed all instances to use the correct module information. It 
is now

on the trunk, revision 20267. Simultaneously,I checked that all other
collective components do the right thing ... and I have to admit 
tuned was

the only faulty one.

This is clearly a bug in the tuned, and correcting it will allow 
people

to use the hierarch. In the current incarnation 1.3 will mostly/always
segfault when hierarch is active. I would prefer not to give a 
broken toy

out there. How about pushing r20267 in the 1.3?

george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:


Thanks for digging into this.  Can you file a bug?  Let's mark it for
v1.3.1.

I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, 
and
since hierarch isn't currently selected by default (you must 
specifically
elevate hierarch's priority to get it to run), there's no danger 
that users

will run into this problem in default runs.

But clearly the problem needs to be fixed, and therefore we need a 
bug

to track it.



On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The 
bug is

unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., 
comm->c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only a
subset of the function are set, e.g. reduce,allreduce, bcast and 
barrier.

Thus, reduce_scatter is from tuned in most scenarios, and calls the
subsequent functions with the wrong module. Hierarch of course 
does not like

that :-)

Anyway, a quick glance through the tuned code reveals a significant
number of instances where this appears(reduce_scatter, allreduce, 
allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly 
correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread George Bosilca
All these errors are in the MPI_Finalize, it should not be that hard  
to find. I'll take a look later this afternoon.


  george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

Unfortunately, although this fixed some problems when enabling  
hierarch coll,
there is still a segfault in two of IU's tests that only shows up  
when we set

-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved  
things,
but there is still more to be done to get coll_hierarch ready for  
regular

use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca  
 wrote:

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have  
to file a

CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:

Unfortunately, this pinpoint the fact that we didn't test enough  
the
collective module mixing thing. I went over the tuned collective  
functions
and changed all instances to use the correct module information.  
It is now
on the trunk, revision 20267. Simultaneously,I checked that all  
other
collective components do the right thing ... and I have to admit  
tuned was

the only faulty one.

This is clearly a bug in the tuned, and correcting it will allow  
people
to use the hierarch. In the current incarnation 1.3 will mostly/ 
always
segfault when hierarch is active. I would prefer not to give a  
broken toy

out there. How about pushing r20267 in the 1.3?

george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark  
it for

v1.3.1.

I say 1.3.1 instead of 1.3.0 because this *only* affects  
hierarch, and
since hierarch isn't currently selected by default (you must  
specifically
elevate hierarch's priority to get it to run), there's no danger  
that users

will run into this problem in default runs.

But clearly the problem needs to be fixed, and therefore we need  
a bug

to track it.



On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously.  
The bug is

unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch,  
only a
subset of the function are set, e.g. reduce,allreduce, bcast  
and barrier.
Thus, reduce_scatter is from tuned in most scenarios, and calls  
the
subsequent functions with the wrong module. Hierarch of course  
does not like

that :-)

Anyway, a quick glance through the tuned code reveals a  
significant
number of instances where this appears(reduce_scatter,  
allreduce, allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly  
correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] crcpw verbosity

2009-01-14 Thread Caciano Machado
Hi,

What variable should I set to increase the verbosity of crcpw component?

I've tried "ompi_crcpw_verbose=20" and "crcpw_base_verbose=20". How
can I figure out the name of the variable.

Regards,
Caciano


Re: [OMPI devel] OpenMPI question

2009-01-14 Thread Jeff Squyres
To followup for the web archives -- we discussed this more off-list.   
AFAIK, compiling Open MPI -- including its memory registration cache  
-- works fine in 32 bit mode, even on 64 bit platforms (there was some  
confusion between virtual and physical memory addresses and who uses  
what [OMPI *only* sees virtual memory addresses because it's user- 
space code]).




On Jan 13, 2009, at 2:48 PM, Jeff Squyres wrote:


On Jan 13, 2009, at 7:37 AM, Alex A. Granovsky wrote:


Am I correct assuming that OpenMPI memory registration/cache module
is completely broken by design on any 32-bit system allowing
physical address space larger than 4 GB, and especially when
compiled for 32-bit under 64-bit OS (e.g., Linux)?



I'm not sure what you mean -- OMPI 32 bit builds on a 64 bit system  
should be ok...?  Have you found a problem?


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] -display-map

2009-01-14 Thread Ralph Castain
We -may- be able to do a more formal XML output at some point. The  
problem will be the natural interleaving of stdout/err from the  
various procs due to the async behavior of MPI. Mpirun receives  
fragmented output in the forwarding system, limited by the buffer  
sizes and the amount of data we can read at any one "bite" from the  
pipes connecting us to the procs. So even though the user -thinks-  
they output a single large line of stuff, it may show up at mpirun as  
a series of fragments. Hence, it gets tricky to know how to put  
appropriate XML brackets around it.


Given this input about when you actually want resolved name info, I  
can at least do something about that area. Won't be in 1.3.0, but  
should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not to turn  
that feature "on" for 1.3.0 as they felt it hasn't been adequately  
tested yet. The code is present, but cannot be activated in 1.3.0.  
However, I believe it is activated on the trunk when you do --xml -- 
tagged-output, so perhaps some testing will help us debug and validate  
it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map, so we  
consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then you  
may as well leave as-is and we will attempt to clean it up in  
Eclipse. It would be nice if a future version of ompi could output  
correct XML (including stdout) as this would vastly simplify the  
parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing this  
afternoon.


The first option would be very hard to do. I would have to expose  
the display-map option across the code base and check it prior to  
printing anything about resolving node names. I guess I should ask:  
do you only want noderesolve statements when we are displaying the  
map? Right now, I will output them regardless.


The second option could be done. I could check if any "display"  
option has been specified, and output the  root at that time  
(likewise for the end). Anything we output in-between would be  
encapsulated between the two, but that would include any user  
output to stdout and/or stderr - which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true XML  
interaction here, but rather a quasi-XML format that would help you  
to filter the output. I have no problem trying to get to something  
more formally correct, but it could be tricky in some places to  
achieve it due to the inherent async nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one problem. To  
be valid, there needs to be only one root element, but currently  
you don't have any (or many). So rather than:














the XML should be:













or:















Would either of these be possible?

Thanks,

Greg

On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in the  
trunk, though. Otherwise, the CMR procedure will fall behind and  
a fix might miss a release window.


Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of  
problems. Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each line is prefixed with "[...]", any idea  
why this is? Also the end tag should be "/>" not ">".


Thanks,

Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3  
in a few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the  
info as you execute the job, if you want. The xml tag is  
"noderesolve" - let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two  
commands to get the information we need. One to get the  
configuration information, such as 

Re: [OMPI devel] autosizing the shared memory backing file

2009-01-14 Thread Ralph Castain
I also know little about that part of the code, but agree that does  
seem weird. Seeing as we know how many local procs there are before we  
get to this point, I would think we could be smart about our memory  
pool size. You might not need to dive into the sm BTL to get the info  
you need - if all you need is how many procs are local, that can be  
obtained fairly easily.


Be happy to contribute to the chat, if it would be helpful.


On Jan 14, 2009, at 7:43 AM, Jeff Squyres wrote:

Ya, that does seem weird to me, but I never fully grokked the whole  
mpool / allocator scheme (I haven't had to interact with that part  
of the code much).


Would it be useful to get on the phone and discuss this stuff?



On Jan 14, 2009, at 1:11 AM, Eugene Loh wrote:

Thanks for the reply.  I kind of understand, but it's rather  
weird.  The BTL calls mca_mpool_base_module_create() to create a  
pool of memory, but the BTL has no say how big of a pool to  
create?  Could you imagine having a memory allocation routine  
("malloc" or something) that didn't allow you to control the size  
of the allocation?  Instead, the allocation routine determines the  
size.  That's weird.  I must be missing something about how this is  
supposed to work.


E.g., I see that there is a "resources" argument  
(mca_mpool_base_resources_t).  Maybe that structure should be  
expanded to include a "size" field?


Or, maybe I should bypass mca_mpool_base_module_create()/ 
mca_mpool_sm_init() and just call
mca_common_sm_mmap_init() directly, the way mca/coll/sm does  
things.  That would allow me to specify the size of the file.


George Bosilca wrote:

The simple answer is you can't. The mpool is loaded before the  
BTLs  and on Linux the loader use the RTLD_NOW flag (i.e. all  
symbols have  to be defined or the dlopen call will fail).


Moreover, there is no way in Open MPI to exchange information  
between  components except a global variable or something in the  
mca/common. In  other words there is no way for you to call from  
the mpool a function  from the sm BTL.


On Jan 13, 2009, at 19:22 , Eugene Loh wrote:

With the sm BTL, there is a file that each process mmaps in for   
shared memory.


I'm trying to get mpool_sm to size the file appropriately.  So,  
I  would like mpool_sm to call some mca_btl_sm function that  
provides a  good guess of the size.  (mpool_sm creates and mmaps  
the file, but  the size depends on parameters like eager limit  
and max frag size  that are known by the btl_sm.)



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: Fragmented sm Allocations

2009-01-14 Thread Jeff Squyres
Whoa, this analysis rocks.  :-)  I'm going through trying to grok it  
all...


Just wanted to say: kudos for this.

On Jan 14, 2009, at 1:14 AM, Eugene Loh wrote:



RFC: Fragmented sm Allocations
WHAT: Dealing with the fragmented allocations of sm BTL FIFO  
circular buffers (CB) during MPI_Init().


Also:

• Improve handling of error codes.
• Automate the sizing of the mmap file.
WHY: To reduce consumption of shared memory, making job startup more  
robust, and possibly improving the scalability of startup.


WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to  
ompi_fifo_init(). This is where CBs are initialized one at a time,  
components of a CB allocated individually. Changes can be seen in ssh://www.open-mpi.org/ 
~eugene/hg/sm-allocation.


WHEN: Upon acceptance.

TIMEOUT: January 30, 2009.

WHY (details)
The sm BTL establishes a FIFO for each non-self, on-node connection.  
Each FIFO is initialized during MPI_Init() with a circular buffer  
(CB). (More CBs can be added later in program execution if a FIFO  
runs out of room.)


A CB has different components that are used in different ways:

	• The "wrapper" is read by both sender and receiver, but is rarely  
written.
	• The "queue" (FIFO entries) is accessed by both the sender and  
receiver.

• The "head" is accessed by the sender.
• The "tail" is accessed by the receiver.
For performance reasons, a CB is not allocated as one large data  
structure. Rather, these components are laid out separately in  
memory and the wrapper has pointers to the various locations.  
Performance considerations include:


	• false sharing: a component used by one process should not share a  
cacheline with another component that is modified by another process
	• NUMA: certain components should perhaps be mapped preferentially  
to memory pages that are close to the processes that access these  
components
Currently, the sm BTL handles these issues by allocating each  
component of each CB its own page. (Actually, it couples tails and  
queues together.)


As the number of on-node processes grows, however, the shared-memory  
allocation skyrockets. E.g., let's say there are n processes on- 
node. There are therefore n(n-1) = O(n2) FIFOs, each with 3  
allocations (wrapper, head, and tail/queue). The shared-memory  
allocation for CBs becomes 3n2 pages. For large n, this dominates  
the shared-memory consumption, even though most of the CB allocation  
is unused. E.g., a 12-byte "head" ends up consuming a full memory  
page!


Not only is the 3n2-page allocation large, but it is also not  
tunable via any MCA parameters.


Large shared-memory consumption has led to some number of start-up  
and other user problems. E.g., the e-mail thread at http://www.open-mpi.org/community/lists/devel/2008/11/4882.php 
.


WHAT (details)
Several actions are recommended here.

1. Cacheline Rather than Pagesize Alignment
The first set of changes reduces pagesize to cacheline alignment.  
Though mapping to pages is motivated by NUMA locality, note:


	• The code already has NUMA locality optimizations (maffinity and  
mpools) anyhow.
	• There is no data that I'm aware of substantiating the benefits of  
locality optimizations in this context.
More to the point, I've tried some such experiments myself. I had  
two processes communicating via shared memory on a large SMP that  
had a large difference between remote and local memory access times.  
I timed the roundtrip latency for pingpongs between the processes.  
That time was correlated to the relative separation between the two  
processes, and not at all to the placement of the physical memory  
backing the shared variables. It did not matter if the memory was  
local to the sender or receiver or remote from both! (E.g., colocal  
processes showed fast timings even if the shared memory were remote  
to both processes.)


My results do not prove that all NUMA platforms behave in the same  
way. My point is only that, though I understand the logic behind  
locality optimizations for FIFO placement, the only data I am aware  
of does not substantiate that logic.


The changes are:

• File: ompi/mca/mpool/sm/mpool_sm_module.c
Function: mca_mpool_sm_alloc()
Use the alignment requested by the caller rather than adding  
additional pagesize alignment as well.


• File: ompi/class/ompi_fifo.h
Function: ompi_fifo_init() and ompi_fifo_write_to_head()
Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.

• File: ompi/class/ompi_circular_buffer_fifo.h
Function: ompi_cb_fifo_init()
Align the two calls to mpool_alloc on cacheline rather than page.

2. Aggregated Allocation
Another option is to lay out all the CBs at once and aggregate their  
allocations.


This may have the added benefit of reducing lock contention during  
MPI_Init(). On the one hand, the 3n2 CB allocations during  
MPI_Init() contend for a single mca_common_sm_mmap->map_seg- 
>seg_lock 

Re: [OMPI devel] autosizing the shared memory backing file

2009-01-14 Thread Jeff Squyres
Ya, that does seem weird to me, but I never fully grokked the whole  
mpool / allocator scheme (I haven't had to interact with that part of  
the code much).


Would it be useful to get on the phone and discuss this stuff?



On Jan 14, 2009, at 1:11 AM, Eugene Loh wrote:

Thanks for the reply.  I kind of understand, but it's rather weird.   
The BTL calls mca_mpool_base_module_create() to create a pool of  
memory, but the BTL has no say how big of a pool to create?  Could  
you imagine having a memory allocation routine ("malloc" or  
something) that didn't allow you to control the size of the  
allocation?  Instead, the allocation routine determines the size.   
That's weird.  I must be missing something about how this is  
supposed to work.


E.g., I see that there is a "resources" argument  
(mca_mpool_base_resources_t).  Maybe that structure should be  
expanded to include a "size" field?


Or, maybe I should bypass mca_mpool_base_module_create()/ 
mca_mpool_sm_init() and just call
mca_common_sm_mmap_init() directly, the way mca/coll/sm does  
things.  That would allow me to specify the size of the file.


George Bosilca wrote:

The simple answer is you can't. The mpool is loaded before the  
BTLs  and on Linux the loader use the RTLD_NOW flag (i.e. all  
symbols have  to be defined or the dlopen call will fail).


Moreover, there is no way in Open MPI to exchange information  
between  components except a global variable or something in the  
mca/common. In  other words there is no way for you to call from  
the mpool a function  from the sm BTL.


On Jan 13, 2009, at 19:22 , Eugene Loh wrote:

With the sm BTL, there is a file that each process mmaps in for   
shared memory.


I'm trying to get mpool_sm to size the file appropriately.  So, I   
would like mpool_sm to call some mca_btl_sm function that provides  
a  good guess of the size.  (mpool_sm creates and mmaps the file,  
but  the size depends on parameters like eager limit and max frag  
size  that are known by the btl_sm.)



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] -display-map

2009-01-14 Thread Greg Watson

Ralph,

The only time we use the resolved names is when we get a map, so we  
consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then you may  
as well leave as-is and we will attempt to clean it up in Eclipse. It  
would be nice if a future version of ompi could output correct XML  
(including stdout) as this would vastly simplify the parsing we need  
to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing this  
afternoon.


The first option would be very hard to do. I would have to expose  
the display-map option across the code base and check it prior to  
printing anything about resolving node names. I guess I should ask:  
do you only want noderesolve statements when we are displaying the  
map? Right now, I will output them regardless.


The second option could be done. I could check if any "display"  
option has been specified, and output the  root at that time  
(likewise for the end). Anything we output in-between would be  
encapsulated between the two, but that would include any user output  
to stdout and/or stderr - which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true XML  
interaction here, but rather a quasi-XML format that would help you  
to filter the output. I have no problem trying to get to something  
more formally correct, but it could be tricky in some places to  
achieve it due to the inherent async nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one problem. To  
be valid, there needs to be only one root element, but currently  
you don't have any (or many). So rather than:














the XML should be:













or:















Would either of these be possible?

Thanks,

Greg

On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in the  
trunk, though. Otherwise, the CMR procedure will fall behind and  
a fix might miss a release window.


Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of  
problems. Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each line is prefixed with "[...]", any idea why  
this is? Also the end tag should be "/>" not ">".


Thanks,

Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3  
in a few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the  
info as you execute the job, if you want. The xml tag is  
"noderesolve" - let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two  
commands to get the information we need. One to get the  
configuration information, such as version and MCA  
parameters, and one to get the host information, whereas it  
would seem more logical that this should all be available via  
some kind of "configuration discovery" command. I understand  
the issue with supplying the hostfile though, so maybe this  
just points at the need for us to separate configuration  
information from the host information. In any case, we'll  
work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason  
we proposed to use mpirun is that "hostfile" has no meaning  
outside of mpirun. That's why ompi_info can't do anything in  
this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for  
the individual app_contexts. These may or may not include  
the node upon which mpirun is executing.


So the only way to provide you with a separate command to  
get a hostfile<->nodename mapping would require you to  
provide us with the default-hostifle and/or hostfile cmd  
line options just as if you were issuing 

Re: [OMPI devel] OpenMPI Performance Problem with Open|SpeedShop

2009-01-14 Thread Ralph Castain
If your timer is actually generating an interrupt to the process, then  
that could be the source of the problem. I believe the event library  
also treats interrupts as events, and assigns them the highest  
priority. So every one of your interrupts would cause the event  
library to stop what it was doing and go into its interrupt handling  
routine.


I'm no expert on the event library though - just speculating that this  
could be the source of the problem.


Ralph

On Jan 13, 2009, at 8:18 PM, William Hachfeld wrote:



Jeff & George,


> Hum; interesting.  I can't think of any reason why that would be a  
problem offhand.  The
> mca_btl_sm_component_progress() function is the shared memory  
progression function.
> opal_progress() and mca_bml_r2_progress() are likely mainly  
dispatching off to this

> function.
>
> Does OSS interfere with shared memory between processes in any  
way?  (I'm not enough
> of a kernel guy to know what the ramifications of ptrace and  
whatnot are)


Open|SS shouldn't interfere with shared memory. We use the pthread  
library to access some TLS, but no shared memory...



> There might be one reason to slowdown the application quite a bit.  
If the fact that you're
> using timer interact with the libevent (the library we're using to  
internally manage any kind
> of events), then we might end-up in the situation where we call  
the poll for every iteration

> in the event library. And this is really expensive.

I did contemplate the notion that maybe we were getting into the  
"progress monitoring" part of OpenMPI every time the timer  
interrupted the process (1000s of times per second). Can either of  
you see any mechanism by which that might happen?



> A quick way to figure out if this is that case is to run Open MPI  
without support for shared
> memory (--mca btl ^sm). This way we will call poll on a regular  
basis anyway, and if there
> is no difference between a normal run and a OSS one, we know at  
least where to start

> looking ...

I ran SMG2000 on an 8-CPU Yellowrail node in the two configurations  
and recorded the wall/cpu clock times as reported by SMG2000 itself:


"mpirun -np 8 smg2000 -n 32 64 64"

Struct Interface, wall clock time = 0.042348 seconds
Struct Interface, cpu clock time = 0.04 seconds
SMG Setup, wall clock time =0.732441 seconds
SMG Setup, cpu clock time = 0.73 seconds
SMG Solve, wall clock time = 6.881814 seconds
SMG Solve, cpu clock time =6.88 seconds

"mpirun --mca btl ^sm -np 8 smg2000 -n 64 64 64"

Struct Interface, wall clock time = 0.059137 seconds
Struct Interface, cpu clock time = 0.06 seconds
SMG Setup, wall clock time = 0.931437 seconds
SMG Setup, cpu clock time = 0.93 seconds
SMG Solve, wall clock time = 9.107343 seconds
SMG Solve, cpu clock time = 9.11 seconds

But running the application with the "--mac btl ^sm" option inside  
Open|SS also results in an extreme slowdown. I.e. it doesn't make  
any difference whether the shared memory transport is enabled or  
not. Open|SS reports time spent as follows (in case this helps  
pinpoint what is going on inside OpenMPI):


Exclusive CPU
time in seconds.Function (defining location)

364.05  btl_openib_component_proress 
(libmpi.so.0)
165.89  mthca_poll_cq 
(libmthca-rdmav2.so)
122.09  pthread_spin_lock 
(libpthread.so.0)
90.79   opal_progress (libopen-pal.so.0)
48.23   mca_bml_r2_progress 
(libmpi.so.0)
30.88   ompi_request_wait_all 
(libmpi.so.0)
9.78pthread_spin_unlock 
(libpthread.so.0)
4.91mthca_free_srq_wqe 
(libmthca-rdmav2.so)
4.91mthca_unlock_cqs 
(libmthca-rdmav2.so)
4.73mthca_lock_cqs 
(libmthca-rdmav2.so)
0.89__poll (libc.so.6)
...

Does this help at all?


-- Bill Hachfeld, The Open|SpeedShop Project






___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: Fragmented sm Allocations

2009-01-14 Thread Ralph Castain
I haven't reviewed the code either, but really do appreciate someone  
taking the time for such a thorough analysis of the problems we have  
all observed for some time!


Thanks Eugene!!


On Jan 14, 2009, at 5:05 AM, Tim Mattox wrote:


Great analysis and suggested changes!  I've not had a chance yet
to look at your hg branch, so this sin't a code review...  Barring a
bad code review, I'd say these changes should all go in the trunk
for inclusion in 1.4.

2009/1/14 Eugene Loh :



RFC: Fragmented sm Allocations

WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular
buffers (CB) during MPI_Init().

Also:

Improve handling of error codes.
Automate the sizing of the mmap file.

WHY: To reduce consumption of shared memory, making job startup  
more robust,

and possibly improving the scalability of startup.

WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to
ompi_fifo_init(). This is where CBs are initialized one at a time,
components of a CB allocated individually. Changes can be seen in
ssh://www.open-mpi.org/~eugene/hg/sm-allocation.

WHEN: Upon acceptance.

TIMEOUT: January 30, 2009.



WHY (details)

The sm BTL establishes a FIFO for each non-self, on-node  
connection. Each
FIFO is initialized during MPI_Init() with a circular buffer (CB).  
(More CBs

can be added later in program execution if a FIFO runs out of room.)

A CB has different components that are used in different ways:

The "wrapper" is read by both sender and receiver, but is rarely  
written.
The "queue" (FIFO entries) is accessed by both the sender and  
receiver.

The "head" is accessed by the sender.
The "tail" is accessed by the receiver.

For performance reasons, a CB is not allocated as one large data  
structure.
Rather, these components are laid out separately in memory and the  
wrapper
has pointers to the various locations. Performance considerations  
include:


false sharing: a component used by one process should not share a  
cacheline

with another component that is modified by another process
NUMA: certain components should perhaps be mapped preferentially to  
memory

pages that are close to the processes that access these components

Currently, the sm BTL handles these issues by allocating each  
component of
each CB its own page. (Actually, it couples tails and queues  
together.)


As the number of on-node processes grows, however, the shared-memory
allocation skyrockets. E.g., let's say there are n processes on- 
node. There
are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations  
(wrapper, head,
and tail/queue). The shared-memory allocation for CBs becomes 3n2  
pages. For
large n, this dominates the shared-memory consumption, even though  
most of
the CB allocation is unused. E.g., a 12-byte "head" ends up  
consuming a full

memory page!

Not only is the 3n2-page allocation large, but it is also not  
tunable via

any MCA parameters.

Large shared-memory consumption has led to some number of start-up  
and other

user problems. E.g., the e-mail thread at
http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.

WHAT (details)

Several actions are recommended here.

1. Cacheline Rather than Pagesize Alignment

The first set of changes reduces pagesize to cacheline alignment.  
Though

mapping to pages is motivated by NUMA locality, note:

The code already has NUMA locality optimizations (maffinity and  
mpools)

anyhow.
There is no data that I'm aware of substantiating the benefits of  
locality

optimizations in this context.

More to the point, I've tried some such experiments myself. I had two
processes communicating via shared memory on a large SMP that had a  
large

difference between remote and local memory access times. I timed the
roundtrip latency for pingpongs between the processes. That time was
correlated to the relative separation between the two processes,  
and not at
all to the placement of the physical memory backing the shared  
variables. It
did not matter if the memory was local to the sender or receiver or  
remote
from both! (E.g., colocal processes showed fast timings even if the  
shared

memory were remote to both processes.)

My results do not prove that all NUMA platforms behave in the same  
way. My

point is only that, though I understand the logic behind locality
optimizations for FIFO placement, the only data I am aware of does  
not

substantiate that logic.

The changes are:

File: ompi/mca/mpool/sm/mpool_sm_module.c
Function: mca_mpool_sm_alloc()

Use the alignment requested by the caller rather than adding  
additional

pagesize alignment as well.

File: ompi/class/ompi_fifo.h
Function: ompi_fifo_init() and ompi_fifo_write_to_head()

Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.

File: ompi/class/ompi_circular_buffer_fifo.h
Function: ompi_cb_fifo_init()

Align the two calls to mpool_alloc on cacheline rather than page.

2. Aggregated Allocation

Another option is 

Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed

2009-01-14 Thread Jeff Squyres
Is there some code that can be fixed instead?  I.e., is this feature  
totally incompatible with whatever RPM compiler flags are used, or is  
it just some coding style that these particular flags don't like?



On Jan 14, 2009, at 5:05 AM, Matthias Jurenz wrote:


Another workaround should be to disable the I/O tracing feature of VT
by adding the configure option

'--with-contrib-vt-flags=--disable-iotrace'

That will have the effect that the upcoming OMPI-rpm's have no support
for I/O tracing, but in our opinion it is not so bad...

Furthermore, we could add the configure option in
'ompi/contrib/vt/configure.m4' to retain the feature-consistency  
between

the rpm's and the source packages.


Matthias

On Tue, 2009-01-13 at 17:13 +0200, Lenny Verkhovsky wrote:
I don't want to move changes ( default value of the flag), since  
there

are important people, for whom it works :)
I also think that this is VT issue, but I guess we are the only one
who experience the errors.

we can now overwrite this params from the environment as a  
workaround,

Mike comitted buildrpm.sh script to the trunk r20253 that allows
overwriting params from the environment.

we observed the problem on CentOS 5.2 with boundled gcc and RedHat  
5.2

with boundled gcc.

#uname -a
Linux elfit1 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
x86_64 x86_64 GNU/Linux

#lsb_release -a
LSB Version:
:core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1- 
amd64:graphics-3.1-ia32:graphics-3.1-noarch

Distributor ID: CentOS
Description:CentOS release 5.2 (Final)
Release:5.2
Codename:   Final

gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)

Best regards,
Lenny.


On Tue, Jan 13, 2009 at 4:40 PM, Jeff Squyres   
wrote:
I'm still guessing that this is a distro / compiler issue -- I can  
build

with the default flags just fine...?

Can you specify what distro / compiler you were using?

Also, if you want to move the changes that have been made to  
buildrpm.sh to
the v1.3 branch, just file a CMR.  That file is not included in  
release

tarballs, so Tim can move it over at any time.



On Jan 13, 2009, at 6:35 AM, Lenny Verkhovsky wrote:

it seems that setting use_default_rpm_opt_flags to 0 solves the  
problem.

Maybe vt developers should take a look on it.

Lenny.


On Sun, Jan 11, 2009 at 2:40 PM, Jeff Squyres  
 wrote:


This sounds like a distro/compiler version issue.

Can you narrow down the issue at all?


On Jan 11, 2009, at 3:23 AM, Lenny Verkhovsky wrote:


it doesnt happen if I do autogen, configure and make install,
only when I try to make an rpm from the tar file.



On Thu, Jan 8, 2009 at 9:43 PM, Jeff Squyres  
 wrote:


This doesn't happen in a normal build of the same tree?

I ask because both 1.3r20226 builds fine for me manually (i.e.,
./configure;make and buildrpm.sh).


On Jan 8, 2009, at 8:15 AM, Lenny Verkhovsky wrote:


Hi,

I am trying to build rpm from nightly snaposhots of 1.3

with the downloaded buildrpm.sh and ompi.spec file from
http://svn.open-mpi.org/svn/ompi/branches/v1.3/contrib/dist/linux/

I am getting this error
.
Making all in vtlib
make[5]: Entering directory

`/hpc/home/USERS/lennyb/work/svn/release/scripts/dist-1.3--1/ 
OMPI/BUILD/

openmpi-1.3rc3r20226/ompi/contrib/vt/vt/vtlib'
gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
-I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
-DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
-DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
-DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,- 
D_FORTIFY_SOURCE=2

-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -MT vt_comp_gnu.o -MD -MP -MF .deps/ 
vt_comp_gnu.Tpo -c

-o
vt_comp_gnu.o vt_comp_gnu.c
gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
-I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
-DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
-DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
-DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,- 
D_FORTIFY_SOURCE=2

-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -MT vt_memhook.o -MD -MP -MF .deps/ 
vt_memhook.Tpo -c -o

vt_memhook.o vt_memhook.c
gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
-I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
-DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
-DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
-DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,- 
D_FORTIFY_SOURCE=2

-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -MT vt_memreg.o -MD -MP -MF .deps/ 
vt_memreg.Tpo -c -o

vt_memreg.o vt_memreg.c
gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
-I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
-DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
-DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
-DVT_MEMHOOK -DVT_IOWRAP  -O2 -g 

Re: [OMPI devel] RFC: Fragmented sm Allocations

2009-01-14 Thread Tim Mattox
Great analysis and suggested changes!  I've not had a chance yet
to look at your hg branch, so this sin't a code review...  Barring a
bad code review, I'd say these changes should all go in the trunk
for inclusion in 1.4.

2009/1/14 Eugene Loh :
>
>
> RFC: Fragmented sm Allocations
>
> WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular
> buffers (CB) during MPI_Init().
>
> Also:
>
> Improve handling of error codes.
> Automate the sizing of the mmap file.
>
> WHY: To reduce consumption of shared memory, making job startup more robust,
> and possibly improving the scalability of startup.
>
> WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to
> ompi_fifo_init(). This is where CBs are initialized one at a time,
> components of a CB allocated individually. Changes can be seen in
> ssh://www.open-mpi.org/~eugene/hg/sm-allocation.
>
> WHEN: Upon acceptance.
>
> TIMEOUT: January 30, 2009.
>
> 
>
> WHY (details)
>
> The sm BTL establishes a FIFO for each non-self, on-node connection. Each
> FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs
> can be added later in program execution if a FIFO runs out of room.)
>
> A CB has different components that are used in different ways:
>
> The "wrapper" is read by both sender and receiver, but is rarely written.
> The "queue" (FIFO entries) is accessed by both the sender and receiver.
> The "head" is accessed by the sender.
> The "tail" is accessed by the receiver.
>
> For performance reasons, a CB is not allocated as one large data structure.
> Rather, these components are laid out separately in memory and the wrapper
> has pointers to the various locations. Performance considerations include:
>
> false sharing: a component used by one process should not share a cacheline
> with another component that is modified by another process
> NUMA: certain components should perhaps be mapped preferentially to memory
> pages that are close to the processes that access these components
>
> Currently, the sm BTL handles these issues by allocating each component of
> each CB its own page. (Actually, it couples tails and queues together.)
>
> As the number of on-node processes grows, however, the shared-memory
> allocation skyrockets. E.g., let's say there are n processes on-node. There
> are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head,
> and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For
> large n, this dominates the shared-memory consumption, even though most of
> the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full
> memory page!
>
> Not only is the 3n2-page allocation large, but it is also not tunable via
> any MCA parameters.
>
> Large shared-memory consumption has led to some number of start-up and other
> user problems. E.g., the e-mail thread at
> http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.
>
> WHAT (details)
>
> Several actions are recommended here.
>
> 1. Cacheline Rather than Pagesize Alignment
>
> The first set of changes reduces pagesize to cacheline alignment. Though
> mapping to pages is motivated by NUMA locality, note:
>
> The code already has NUMA locality optimizations (maffinity and mpools)
> anyhow.
> There is no data that I'm aware of substantiating the benefits of locality
> optimizations in this context.
>
> More to the point, I've tried some such experiments myself. I had two
> processes communicating via shared memory on a large SMP that had a large
> difference between remote and local memory access times. I timed the
> roundtrip latency for pingpongs between the processes. That time was
> correlated to the relative separation between the two processes, and not at
> all to the placement of the physical memory backing the shared variables. It
> did not matter if the memory was local to the sender or receiver or remote
> from both! (E.g., colocal processes showed fast timings even if the shared
> memory were remote to both processes.)
>
> My results do not prove that all NUMA platforms behave in the same way. My
> point is only that, though I understand the logic behind locality
> optimizations for FIFO placement, the only data I am aware of does not
> substantiate that logic.
>
> The changes are:
>
> File: ompi/mca/mpool/sm/mpool_sm_module.c
> Function: mca_mpool_sm_alloc()
>
> Use the alignment requested by the caller rather than adding additional
> pagesize alignment as well.
>
> File: ompi/class/ompi_fifo.h
> Function: ompi_fifo_init() and ompi_fifo_write_to_head()
>
> Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.
>
> File: ompi/class/ompi_circular_buffer_fifo.h
> Function: ompi_cb_fifo_init()
>
> Align the two calls to mpool_alloc on cacheline rather than page.
>
> 2. Aggregated Allocation
>
> Another option is to lay out all the CBs at once and aggregate their
> allocations.
>
> This may have the added benefit of reducing lock 

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Tim Mattox
Unfortunately, although this fixed some problems when enabling hierarch coll,
there is still a segfault in two of IU's tests that only shows up when we set
-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved things,
but there is still more to be done to get coll_hierarch ready for regular
use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca  wrote:
> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
>  george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>> Let's debate tomorrow when people are around, but first you have to file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>> Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective functions
>>> and changed all instances to use the correct module information. It is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow people
>>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
 Thanks for digging into this.  Can you file a bug?  Let's mark it for
 v1.3.1.

 I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and
 since hierarch isn't currently selected by default (you must specifically
 elevate hierarch's priority to get it to run), there's no danger that users
 will run into this problem in default runs.

 But clearly the problem needs to be fixed, and therefore we need a bug
 to track it.



 On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

> I just debugged the Reduce_scatter bug mentioned previously. The bug is
> unfortunately not in hierarch, but in tuned.
>
> Here is the code snipplet causing the problems:
>
> int reduce_scatter (, mca_coll_base_module_t *module)
> {
> ...
> err = comm->c_coll.coll_reduce (, module)
> ...
> }
>
>
> but should be
> {
> ...
> err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
> ...
> }
>
> The problem as it is right now is, that when using hierarch, only a
> subset of the function are set, e.g. reduce,allreduce, bcast and barrier.
> Thus, reduce_scatter is from tuned in most scenarios, and calls the
> subsequent functions with the wrong module. Hierarch of course does not 
> like
> that :-)
>
> Anyway, a quick glance through the tuned code reveals a significant
> number of instances where this appears(reduce_scatter, allreduce, 
> allgather,
> allgatherv). Basic, hierarch and inter seem to do that mostly correctly.
>
> Thanks
> Edgar
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab  http://pstl.cs.uh.edu
> Department of Computer Science  University of Houston
> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


 --
 Jeff Squyres
 Cisco Systems

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed

2009-01-14 Thread Matthias Jurenz
Another workaround should be to disable the I/O tracing feature of VT
by adding the configure option

'--with-contrib-vt-flags=--disable-iotrace'

That will have the effect that the upcoming OMPI-rpm's have no support
for I/O tracing, but in our opinion it is not so bad...

Furthermore, we could add the configure option in
'ompi/contrib/vt/configure.m4' to retain the feature-consistency between
the rpm's and the source packages.


Matthias

On Tue, 2009-01-13 at 17:13 +0200, Lenny Verkhovsky wrote:
> I don't want to move changes ( default value of the flag), since there
> are important people, for whom it works :)
> I also think that this is VT issue, but I guess we are the only one
> who experience the errors.
> 
> we can now overwrite this params from the environment as a workaround,
> Mike comitted buildrpm.sh script to the trunk r20253 that allows
> overwriting params from the environment.
> 
> we observed the problem on CentOS 5.2 with boundled gcc and RedHat 5.2
> with boundled gcc.
> 
> #uname -a
> Linux elfit1 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
> x86_64 x86_64 GNU/Linux
> 
> #lsb_release -a
> LSB Version:
> :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1-amd64:graphics-3.1-ia32:graphics-3.1-noarch
> Distributor ID: CentOS
> Description:CentOS release 5.2 (Final)
> Release:5.2
> Codename:   Final
> 
> gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
> 
> Best regards,
> Lenny.
> 
> 
> On Tue, Jan 13, 2009 at 4:40 PM, Jeff Squyres  wrote:
> > I'm still guessing that this is a distro / compiler issue -- I can build
> > with the default flags just fine...?
> >
> > Can you specify what distro / compiler you were using?
> >
> > Also, if you want to move the changes that have been made to buildrpm.sh to
> > the v1.3 branch, just file a CMR.  That file is not included in release
> > tarballs, so Tim can move it over at any time.
> >
> >
> >
> > On Jan 13, 2009, at 6:35 AM, Lenny Verkhovsky wrote:
> >
> >> it seems that setting use_default_rpm_opt_flags to 0 solves the problem.
> >> Maybe vt developers should take a look on it.
> >>
> >> Lenny.
> >>
> >>
> >> On Sun, Jan 11, 2009 at 2:40 PM, Jeff Squyres  wrote:
> >>>
> >>> This sounds like a distro/compiler version issue.
> >>>
> >>> Can you narrow down the issue at all?
> >>>
> >>>
> >>> On Jan 11, 2009, at 3:23 AM, Lenny Verkhovsky wrote:
> >>>
>  it doesnt happen if I do autogen, configure and make install,
>  only when I try to make an rpm from the tar file.
> 
> 
> 
>  On Thu, Jan 8, 2009 at 9:43 PM, Jeff Squyres  wrote:
> >
> > This doesn't happen in a normal build of the same tree?
> >
> > I ask because both 1.3r20226 builds fine for me manually (i.e.,
> > ./configure;make and buildrpm.sh).
> >
> >
> > On Jan 8, 2009, at 8:15 AM, Lenny Verkhovsky wrote:
> >
> >> Hi,
> >>
> >> I am trying to build rpm from nightly snaposhots of 1.3
> >>
> >> with the downloaded buildrpm.sh and ompi.spec file from
> >> http://svn.open-mpi.org/svn/ompi/branches/v1.3/contrib/dist/linux/
> >>
> >> I am getting this error
> >> .
> >> Making all in vtlib
> >> make[5]: Entering directory
> >>
> >> `/hpc/home/USERS/lennyb/work/svn/release/scripts/dist-1.3--1/OMPI/BUILD/
> >> openmpi-1.3rc3r20226/ompi/contrib/vt/vt/vtlib'
> >> gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
> >> -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
> >> -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
> >> -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
> >> -DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> >> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> >> -mtune=generic -MT vt_comp_gnu.o -MD -MP -MF .deps/vt_comp_gnu.Tpo -c
> >> -o
> >> vt_comp_gnu.o vt_comp_gnu.c
> >> gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
> >> -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
> >> -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
> >> -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
> >> -DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> >> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> >> -mtune=generic -MT vt_memhook.o -MD -MP -MF .deps/vt_memhook.Tpo -c -o
> >> vt_memhook.o vt_memhook.c
> >> gcc  -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib
> >> -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE
> >> -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\"
> >> -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG
> >> -DVT_MEMHOOK -DVT_IOWRAP  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> >> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> >> -mtune=generic -MT vt_memreg.o -MD -MP -MF 

[OMPI devel] RFC: Fragmented sm Allocations

2009-01-14 Thread Eugene Loh


Title: RFC: Fragmented sm Allocations




RFC: Fragmented sm Allocations


WHAT:  Dealing with the fragmented allocations of sm BTL FIFO
circular buffers (CB) during MPI_Init().


Also:

 Improve handling of error codes.
 Automate the sizing of the mmap file.



WHY:  To reduce consumption of shared memory, making job startup
more robust, and possibly improving the scalability of startup.


WHERE: In mca_btl_sm_add_procs(), there is a loop
over calls to ompi_fifo_init().  This is where CBs are
initialized one at a time, components of a CB allocated individually.
Changes can be seen in
ssh://www.open-mpi.org/~eugene/hg/sm-allocation.


WHEN:  Upon acceptance.


TIMEOUT:  January 30, 2009.



WHY (details)


The sm BTL establishes a FIFO for each non-self, on-node
connection.  Each FIFO is initialized during MPI_Init()
with a circular buffer (CB).  (More CBs can be added later in
program execution if a FIFO runs out of room.)


A CB has different components that are used in different ways:

 The "wrapper" is read by both sender and receiver,
 but is rarely written.
 The "queue" (FIFO entries) is accessed by both the
 sender and receiver.
 The "head" is accessed by the sender.
 The "tail" is accessed by the receiver.



For performance reasons, a CB is not allocated as one large data structure.
Rather, these components are laid out separately in memory and the
wrapper has pointers to the various locations.  Performance
considerations include:

 false sharing: a component used by one process should
 not share a cacheline with another component that is
 modified by another process
 NUMA: certain components should perhaps be mapped
 preferentially to memory pages that are close to the
 processes that access these components



Currently, the sm BTL handles these issues by allocating
each component of each CB its own page.  (Actually, it couples
tails and queues together.)


As the number of on-node processes grows, however, the shared-memory
allocation skyrockets.  E.g., let's say there are n processes
on-node.  There are therefore n(n-1) = O(n2)
FIFOs, each with 3 allocations (wrapper, head, and tail/queue).  The
shared-memory allocation for CBs becomes 3n2 pages.
For large n, this dominates the shared-memory consumption,
even though most of the CB allocation is unused.  E.g., a 12-byte
"head" ends up consuming a full memory page!


Not only is the 3n2-page allocation large, but it
is also not tunable via any MCA parameters.


Large shared-memory consumption has led to some number of
start-up and other user problems.  E.g., the e-mail thread at
http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.

WHAT (details)


Several actions are recommended here.

1.  Cacheline Rather than Pagesize Alignment


The first set of changes reduces pagesize to cacheline alignment.
Though mapping to pages is motivated by NUMA locality, note:

 The code already has NUMA locality optimizations
 (maffinity and mpools) anyhow.
 There is no data that I'm aware of substantiating the
 benefits of locality optimizations in this context.
 
 
 More to the point, I've tried some such experiments myself.
 I had two processes communicating via shared memory on a
 large SMP that had a large difference between remote and local
 memory access times.  I timed the roundtrip latency for
 pingpongs between the processes.  That time was correlated
 to the relative separation between the two processes, and
 not at all to the placement of the physical memory backing
 the shared variables.  It did not matter if the memory was
 local to the sender or receiver or remote from both!  (E.g.,
 colocal processes showed fast timings even if the shared
 memory were remote to both processes.)
 
 My results do not prove that all NUMA platforms behave in the
 same way.  My point is only that, though I understand the
 logic behind locality optimizations for FIFO placement, the
 only data I am aware of does not substantiate that logic.
 



The changes are:

 File: ompi/mca/mpool/sm/mpool_sm_module.c
 Function: mca_mpool_sm_alloc()
 
 Use the alignment requested by the caller rather than
 adding additional pagesize alignment as well.
 File: ompi/class/ompi_fifo.h
 Function: ompi_fifo_init()
  and ompi_fifo_write_to_head()
 
 Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.
 File: ompi/class/ompi_circular_buffer_fifo.h
 Function: ompi_cb_fifo_init()
 
 Align the two calls to mpool_alloc on cacheline rather than page.


2.  Aggregated Allocation


Another option is to lay out all the CBs at once and aggregate
their allocations.


This may have the added benefit of reducing lock contention during
MPI_Init().  On the one hand, the 3n2 CB
allocations during MPI_Init() contend for a single
mca_common_sm_mmap->map_seg->seg_lock lock.  On the other
hand, I know so far of no data 

Re: [OMPI devel] autosizing the shared memory backing file

2009-01-14 Thread Eugene Loh
Thanks for the reply.  I kind of understand, but it's rather weird.  The 
BTL calls mca_mpool_base_module_create() to create a pool of memory, but 
the BTL has no say how big of a pool to create?  Could you imagine 
having a memory allocation routine ("malloc" or something) that didn't 
allow you to control the size of the allocation?  Instead, the 
allocation routine determines the size.  That's weird.  I must be 
missing something about how this is supposed to work.


E.g., I see that there is a "resources" argument 
(mca_mpool_base_resources_t).  Maybe that structure should be expanded 
to include a "size" field?


Or, maybe I should bypass 
mca_mpool_base_module_create()/mca_mpool_sm_init() and just call
mca_common_sm_mmap_init() directly, the way mca/coll/sm does things.  
That would allow me to specify the size of the file.


George Bosilca wrote:

The simple answer is you can't. The mpool is loaded before the BTLs  
and on Linux the loader use the RTLD_NOW flag (i.e. all symbols have  
to be defined or the dlopen call will fail).


Moreover, there is no way in Open MPI to exchange information between  
components except a global variable or something in the mca/common. 
In  other words there is no way for you to call from the mpool a 
function  from the sm BTL.


On Jan 13, 2009, at 19:22 , Eugene Loh wrote:

With the sm BTL, there is a file that each process mmaps in for  
shared memory.


I'm trying to get mpool_sm to size the file appropriately.  So, I  
would like mpool_sm to call some mca_btl_sm function that provides a  
good guess of the size.  (mpool_sm creates and mmaps the file, but  
the size depends on parameters like eager limit and max frag size  
that are known by the btl_sm.)




Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread George Bosilca

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

  george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have to  
file a CMR... :-)


On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:

Unfortunately, this pinpoint the fact that we didn't test enough  
the collective module mixing thing. I went over the tuned  
collective functions and changed all instances to use the correct  
module information. It is now on the trunk, revision 20267.  
Simultaneously,I checked that all other collective components do  
the right thing ... and I have to admit tuned was the only faulty  
one.


This is clearly a bug in the tuned, and correcting it will allow  
people to use the hierarch. In the current incarnation 1.3 will  
mostly/always segfault when hierarch is active. I would prefer not  
to give a broken toy out there. How about pushing r20267 in the 1.3?


george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark it  
for v1.3.1.


I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,  
and since hierarch isn't currently selected by default (you must  
specifically elevate hierarch's priority to get it to run),  
there's no danger that users will run into this problem in default  
runs.


But clearly the problem needs to be fixed, and therefore we need a  
bug to track it.




On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The  
bug is unfortunately not in hierarch, but in tuned.


Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only  
a subset of the function are set, e.g. reduce,allreduce, bcast  
and barrier. Thus, reduce_scatter is from tuned in most  
scenarios, and calls the subsequent functions with the wrong  
module. Hierarch of course does not like that :-)


Anyway, a quick glance through the tuned code reveals a  
significant number of instances where this  
appears(reduce_scatter, allreduce, allgather, allgatherv). Basic,  
hierarch and inter seem to do that mostly correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel