Re: [OMPI devel] Contributor License Agreement

2014-08-29 Thread Ralph Castain
Well, as I said, I'm not a lawyer and I'm not about to argue legal issues with 
you :-)

All I can do is reiterate that the lawyers involved (a) felt that this was 
required as a gateway to providing commit permissions to our repository, (b) 
that it was separate from the license itself, and (c) that everything was fully 
compatible. It's been reviewed and ratified by additional legal teams over the 
years as other contributing corporations and organizations have joined.

You are welcome to contact the legal departments of the contributing 
organizations (as shown on our web site) to request a legal explanation.


On Aug 29, 2014, at 10:02 AM, Jed Brown  wrote:

> Ralph Castain  writes:
> 
>> I'm not a lawyer, but that agreement was formulated by the lawyers of
>> several national labs, universities, and corporations back at the very
>> beginning of the project, and so that's what we have to use.
> 
> MPICH does the same thing, but the CLA grants permission to redistribute
> under the terms of the Apache-2.0 license, yet the product is being
> distributed under a license that is not compatible with Apache-2.0.
> 
> Even more strongly, lots of BSD-style permissive software is
> incorporated into GPLv2 and LGPLv2.1 distributions (this direction is
> well-tested).  Meanwhile, Apache-2.0 is famously incompatible with GPLv2
> and LGPLv2.1.
> 
>  "Please note that this license is not compatible with GPL version 2,
>  because it has some requirements that are not in that GPL
>  version. These include certain patent termination and indemnification
>  provisions." -- https://www.gnu.org/licenses/license-list.html#apache2
> 
> License compatibility is a transitive property, so you certainly can't
> distribute Apache-2.0 software under a BSD-style license (as shown in
> the compatibility chart I posted).
> 
>> According to the lawyers, it is indeed compatible. I'll let them argue
>> it :-)
> 
> I'd be interested in hearing the legal argument for why the apparent
> incompatibility is legally acceptable.  I'm not familiar with other
> projects using this particular combination.



Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Ralph Castain
I think the problem is that the MCA params need to be set at startup, along 
with the flag indicating where they came from, but also need to be changeable 
via the MPI_T interface at a later point. So we are tripping over the issues of 
when to release and replace the param values, ensuring they are properly 
handled and don't cause the MCA param system to crash upon finalize, etc.


On Aug 29, 2014, at 8:22 AM, Shamis, Pavel  wrote:

> I was under impression that mca_tl_openib_tune_endpoint supposed to handle 
> the miss-match between tunings of different devices.
> Few years ago we did some "extreme" inter-operability testing and ompi 
> handled all cases really well.
> 
> I'm not sure if I understand correctly what is the "core" issue.
> 
> 
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> 
> 
> 
> 
> 
> 
> On Aug 29, 2014, at 4:12 AM, Gilles Gouaillardet 
> > wrote:
> 
> Ralph,
> 
> 
> r32639 and r32642 fixes bugs that do exist in both trunk and v1.8, and they 
> can be considered as independent of the issue that is discussed in this 
> thread and the one you pointed.
> 
> so imho, they should land v1.8 even if they do not fix the issue we are now 
> discussing here
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/08/29 16:42, Ralph Castain wrote:
> 
> This is the email thread which sparked the problem:
> 
> http://www.open-mpi.org/community/lists/devel/2014/07/15329.php
> 
> I actually tried to apply the original CMR and couldn't get it to work in the 
> 1.8 branch - just kept having problems, so I pushed it off to 1.8.3. I'm 
> leery to accept either of the current CMRs for two reasons: (a) none of the 
> preceding changes is in the 1.8 series yet, and (b) it doesn't sound like we 
> still have a complete solution.
> 
> Anyway, I just wanted to point to the original problem that was trying to be 
> addressed.
> 
> 
> On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet 
>  wrote:
> 
> 
> 
> Howard and Edgar,
> 
> i fixed a few bugs (r32639 and r32642)
> 
> the bug is trivial to reproduce with any mpi hello world program
> 
> mpirun -np 2 --mca btl openib,self hello_world
> 
> after setting the mca param in the $HOME/.openmpi/mca-params.conf
> 
> $ cat ~/.openmpi/mca-params.conf
> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
> 
> good news is the program does not crash with a glory SIGSEGV any more
> bad news is the program will (nicely) abort for an incorrect reason :
> 
> --
> The Open MPI receive queue configuration for the OpenFabrics devices
> on two nodes are incompatible, meaning that MPI processes on two
> specific nodes were unable to communicate with each other.  This
> generally happens when you are using OpenFabrics devices from
> different vendors on the same network.  You should be able to use the
> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
> queue configuration for all the devices in the MPI job, and therefore
> be able to run successfully.
> 
> Local host:   node0
> Local adapter:mlx4_0 (vendor 0x2c9, part ID 4099)
> Local queues: S,12288,128,64,32:S,65536,128,64,3
> 
> Remote host:  node0
> Remote adapter:   (vendor 0x2c9, part ID 4099)
> Remote queues:
> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
> 
> the root cause is the remote host did not send its receive_queues to the
> local host
> (and hence the local host believes the remote hosts uses the default value)
> 
> the logic was revamped vs v1.8, that is why v1.8 does not have such issue.
> 
> i am still thinking what should be the right fix :
> - one option is to send the receive queues
> - an other option would be to differenciate value overrided in
> mca-params.conf (should be always ok) of value overrided in the .ini
> (might want to double check local and remote values match)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/29 7:02, Pritchard Jr., Howard wrote:
> 
> 
> Hi Edgar,
> 
> Could you send me your conf file?  I'll try to reproduce it.
> 
> Maybe run with --mca btl_base_verbose 20 or something to
> see what the code that is parsing this field in the conf file
> is finding.
> 
> 
> Howard
> 
> 
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
> Sent: Thursday, August 28, 2014 3:40 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] segfault in openib component on trunk
> 
> to add another piece of information that I just found, the segfault only 
> occurs if I have a particular mca parameter set in my mca-params.conf file, 
> namely
> 
> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
> 
> Has the syntax for this parameter changed, or should/can I get rid of 

Re: [OMPI devel] mpirun hangs when a task exits with a non zero code

2014-08-29 Thread Ralph Castain
I dug into this a bit and think the patch wasn't quite complete, so I modified 
the approach to ensure this race condition gets resolved in every scenario. 
Hopefully, r32643 takes care of it for you.


On Aug 29, 2014, at 1:08 AM, Gilles Gouaillardet 
 wrote:

> Ralph and all,
> 
> The following trivial test hangs
> /* it hangs at least 99% of the time in my environment, 1% is a race
> condition and the program behaves as expected */
> 
> mpirun -np 1 --mca btl self /bin/false
> 
> same behaviour happen with the following trivial but MPI program :
> 
> #include 
> 
> int main (int argc, char *argv[]) {
>MPI_Init(, );
>MPI_Finalize();
>return 1;
> }
> 
> The attached patch fixes the hang (e.g. the program nicely abort with
> the correct error message)
> 
> i did not commit it since i am not confident at all
> 
> could you please review it ?
> 
> Cheers
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15751.php



Re: [OMPI devel] Contributor License Agreement

2014-08-29 Thread Jed Brown
Ralph Castain  writes:

> I'm not a lawyer, but that agreement was formulated by the lawyers of
> several national labs, universities, and corporations back at the very
> beginning of the project, and so that's what we have to use.

MPICH does the same thing, but the CLA grants permission to redistribute
under the terms of the Apache-2.0 license, yet the product is being
distributed under a license that is not compatible with Apache-2.0.

Even more strongly, lots of BSD-style permissive software is
incorporated into GPLv2 and LGPLv2.1 distributions (this direction is
well-tested).  Meanwhile, Apache-2.0 is famously incompatible with GPLv2
and LGPLv2.1.

  "Please note that this license is not compatible with GPL version 2,
  because it has some requirements that are not in that GPL
  version. These include certain patent termination and indemnification
  provisions." -- https://www.gnu.org/licenses/license-list.html#apache2

License compatibility is a transitive property, so you certainly can't
distribute Apache-2.0 software under a BSD-style license (as shown in
the compatibility chart I posted).

> According to the lawyers, it is indeed compatible. I'll let them argue
> it :-)

I'd be interested in hearing the legal argument for why the apparent
incompatibility is legally acceptable.  I'm not familiar with other
projects using this particular combination.


pgpUsrdSu_T8q.pgp
Description: PGP signature


Re: [OMPI devel] Contributor License Agreement

2014-08-29 Thread Ralph Castain
I'm not a lawyer, but that agreement was formulated by the lawyers of several 
national labs, universities, and corporations back at the very beginning of the 
project, and so that's what we have to use.

According to the lawyers, it is indeed compatible. I'll let them argue it :-)


On Aug 29, 2014, at 8:47 AM, Jed Brown  wrote:

> Ralph Castain  writes:
> 
>> Indeed, welcome!
>> 
>> Just to make things smoother: are you planning to contribute your work
>> back to the community? If so, we'll need a signed contributor
>> agreement - see here:
>> 
>> http://www.open-mpi.org/community/contribute/corporate.php
> 
> This is an Apache-2.0 CLA, which has a patent indemnification clause.
> Why is this used even though Open MPI is distributed under 3-clause BSD,
> an incompatible license.  (You can include BSD-3 software in an
> Apache-2.0 distribution, but not vice-versa due to the patent
> indemnification clause.)
> 
>  http://www.dwheeler.com/essays/floss-license-slide.html
> 
> 
> 
> FWIW, I think a DCO provides more accurate provenance and I agree with
> Bradley Kuhn's arguments that CLAs raise the barrier for entry and are
> unnecessary.
> 
>  http://ebb.org/bkuhn/blog/2011/07/07/harmony-harmful.html



[OMPI devel] Contributor License Agreement

2014-08-29 Thread Jed Brown
Ralph Castain  writes:

> Indeed, welcome!
>
> Just to make things smoother: are you planning to contribute your work
> back to the community? If so, we'll need a signed contributor
> agreement - see here:
>
> http://www.open-mpi.org/community/contribute/corporate.php

This is an Apache-2.0 CLA, which has a patent indemnification clause.
Why is this used even though Open MPI is distributed under 3-clause BSD,
an incompatible license.  (You can include BSD-3 software in an
Apache-2.0 distribution, but not vice-versa due to the patent
indemnification clause.)

  http://www.dwheeler.com/essays/floss-license-slide.html



FWIW, I think a DCO provides more accurate provenance and I agree with
Bradley Kuhn's arguments that CLAs raise the barrier for entry and are
unnecessary.

  http://ebb.org/bkuhn/blog/2011/07/07/harmony-harmful.html


pgpcYCSyQd_mr.pgp
Description: PGP signature


Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Shamis, Pavel
I was under impression that mca_tl_openib_tune_endpoint supposed to handle the 
miss-match between tunings of different devices.
Few years ago we did some "extreme" inter-operability testing and ompi handled 
all cases really well.

I'm not sure if I understand correctly what is the "core" issue.


Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Aug 29, 2014, at 4:12 AM, Gilles Gouaillardet 
> wrote:

Ralph,


r32639 and r32642 fixes bugs that do exist in both trunk and v1.8, and they can 
be considered as independent of the issue that is discussed in this thread and 
the one you pointed.

so imho, they should land v1.8 even if they do not fix the issue we are now 
discussing here

Cheers,

Gilles


On 2014/08/29 16:42, Ralph Castain wrote:

This is the email thread which sparked the problem:

http://www.open-mpi.org/community/lists/devel/2014/07/15329.php

I actually tried to apply the original CMR and couldn't get it to work in the 
1.8 branch - just kept having problems, so I pushed it off to 1.8.3. I'm leery 
to accept either of the current CMRs for two reasons: (a) none of the preceding 
changes is in the 1.8 series yet, and (b) it doesn't sound like we still have a 
complete solution.

Anyway, I just wanted to point to the original problem that was trying to be 
addressed.


On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet 
 wrote:



Howard and Edgar,

i fixed a few bugs (r32639 and r32642)

the bug is trivial to reproduce with any mpi hello world program

mpirun -np 2 --mca btl openib,self hello_world

after setting the mca param in the $HOME/.openmpi/mca-params.conf

$ cat ~/.openmpi/mca-params.conf
btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3

good news is the program does not crash with a glory SIGSEGV any more
bad news is the program will (nicely) abort for an incorrect reason :

--
The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other.  This
generally happens when you are using OpenFabrics devices from
different vendors on the same network.  You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.

 Local host:   node0
 Local adapter:mlx4_0 (vendor 0x2c9, part ID 4099)
 Local queues: S,12288,128,64,32:S,65536,128,64,3

 Remote host:  node0
 Remote adapter:   (vendor 0x2c9, part ID 4099)
 Remote queues:
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

the root cause is the remote host did not send its receive_queues to the
local host
(and hence the local host believes the remote hosts uses the default value)

the logic was revamped vs v1.8, that is why v1.8 does not have such issue.

i am still thinking what should be the right fix :
- one option is to send the receive queues
- an other option would be to differenciate value overrided in
mca-params.conf (should be always ok) of value overrided in the .ini
 (might want to double check local and remote values match)

Cheers,

Gilles

On 2014/08/29 7:02, Pritchard Jr., Howard wrote:


Hi Edgar,

Could you send me your conf file?  I'll try to reproduce it.

Maybe run with --mca btl_base_verbose 20 or something to
see what the code that is parsing this field in the conf file
is finding.


Howard


-Original Message-
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
Sent: Thursday, August 28, 2014 3:40 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] segfault in openib component on trunk

to add another piece of information that I just found, the segfault only occurs 
if I have a particular mca parameter set in my mca-params.conf file, namely

btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3

Has the syntax for this parameter changed, or should/can I get rid of it?

Thanks
Edgar

On 08/28/2014 04:19 PM, Edgar Gabriel wrote:


we are having recently problems running trunk with openib component
enabled on one of our clusters. The problem occurs right in the
initialization part, here is the stack right before the segfault:

---snip---
(gdb) where
#0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
endpoint=0x7d9660) at btl_openib.c:470
#1  0x7f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40,
nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
btl_openib.c:1093
#2  0x7f106316102c in mca_bml_r2_add_procs (nprocs=2,
procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
#3  0x7f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,

Re: [OMPI devel] Fwd: recomended software stack for development?

2014-08-29 Thread Ralph Castain
Indeed, welcome!

Just to make things smoother: are you planning to contribute your work back to 
the community? If so, we'll need a signed contributor agreement - see here:

http://www.open-mpi.org/community/contribute/corporate.php


On Aug 29, 2014, at 7:40 AM, Jeff Squyres (jsquyres)  wrote:

> On Aug 29, 2014, at 5:36 AM, Manuel Rodríguez Pascual  
> wrote:
> 
>> We are a small development team that will soon start working in open-mpi. 
> 
> Welcome!
> 
>> Being total newbies on the area (both on open-mpi and in this kind of large 
>> projects), we are seeking for advise in which tools to use on the 
>> development. Any suggestion on IDE, compiler, regression testing software 
>> and everything else is more than welcome. Of course this is highly personal, 
>> but it would be great to know what you folks are using to help us decide and 
>> start working.  
> 
> I think you'll find us all over the map on IDE.  I personally use 
> emacs+terminal.  I know others who use vim+terminal.  Many of us use ctags 
> and the like, but it's not quite as helpful as usual because of OMPI's heavy 
> use of pointers.  I don't think many developers use a full-blown IDE.
> 
> For compiler, I'm guessing most of us develop with gcc most of the time, 
> although a few may have non-gcc as the default.  We test across a wide 
> variety of compilers, so portability is important.
> 
> For regression testing, we use the MPI Testing Tool 
> (https://svn.open-mpi.org/trac/mtt/ and http://mtt.open-mpi.org/).  Many of 
> us have it configured to do builds of the nightly tarballs; some of us push 
> our results to the public database at mtt.open-mpi.org.
> 
>> Thanks for your help. We are really looking to cooperate with the project, 
>> so we'll hopefully see you around here for a while!
> 
> Just curious: what do you anticipate working on?
> 
> It might be a good idea to see our "intro to the OMPI code base" videos: 
> http://www.open-mpi.org/video/?category=internals
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15754.php



Re: [OMPI devel] about the test_shmem_zero_get.x test from the openshmem test suite

2014-08-29 Thread Jeff Squyres (jsquyres)
Gilles --

Did you get a reply about this?


On Aug 26, 2014, at 3:17 AM, Gilles Gouaillardet 
 wrote:

> Folks,
> 
> the test_shmem_zero_get.x from the openshmem-release-1.0d test suite is
> currently failing.
> 
> i looked at the test itself, and compared it to test_shmem_zero_put.x
> (that is a success) and
> i am very puzzled ...
> 
> the test calls several flavors of shmem_*_get where :
> - the destination is in the shmem (why not, but this is useless)
> - the source is *not* in the shmem
> - the number of elements to be transferred is zero
> 
> currently, this is failing because the source is *not* in the shmem.
> 
> 1) is the test itself correct ?
> i mean that if we compare it to test_shmem_zero_put.x, i would guess that
> destination should be in the local memory and source should be in the shmem.
> 
> 2) should shmem_*_get even fail ?
> i mean there is zero data to be transferred, so why do we even care
> whether source is in the shmem or not ?
> is the openshmem standard explicit about this case (e.g. zero elements
> to be transferred) ?
> 
> 3) is a failure expected ?
> even if i doubt it, this is an option ... and in this case, mtt should
> be aware about it and report a success when the test fails
> 
> 4) the test is a success on v1.8.
> the reason is the default configure value is --oshmem-param-check=never
> on v1.8 whereas it is --oshmem-param-check=always on trunk
> is there any reason for this ?
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15707.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Fwd: recomended software stack for development?

2014-08-29 Thread Jeff Squyres (jsquyres)
On Aug 29, 2014, at 5:36 AM, Manuel Rodríguez Pascual  
wrote:

> We are a small development team that will soon start working in open-mpi. 

Welcome!

> Being total newbies on the area (both on open-mpi and in this kind of large 
> projects), we are seeking for advise in which tools to use on the 
> development. Any suggestion on IDE, compiler, regression testing software and 
> everything else is more than welcome. Of course this is highly personal, but 
> it would be great to know what you folks are using to help us decide and 
> start working.  

I think you'll find us all over the map on IDE.  I personally use 
emacs+terminal.  I know others who use vim+terminal.  Many of us use ctags and 
the like, but it's not quite as helpful as usual because of OMPI's heavy use of 
pointers.  I don't think many developers use a full-blown IDE.

For compiler, I'm guessing most of us develop with gcc most of the time, 
although a few may have non-gcc as the default.  We test across a wide variety 
of compilers, so portability is important.

For regression testing, we use the MPI Testing Tool 
(https://svn.open-mpi.org/trac/mtt/ and http://mtt.open-mpi.org/).  Many of us 
have it configured to do builds of the nightly tarballs; some of us push our 
results to the public database at mtt.open-mpi.org.

> Thanks for your help. We are really looking to cooperate with the project, so 
> we'll hopefully see you around here for a while!

Just curious: what do you anticipate working on?

It might be a good idea to see our "intro to the OMPI code base" videos: 
http://www.open-mpi.org/video/?category=internals

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Fwd: recomended software stack for development?

2014-08-29 Thread Manuel Rodríguez Pascual
Good morning all,

We are a small development team that will soon start working in open-mpi.

Being total newbies on the area (both on open-mpi and in this kind of large
projects), we are seeking for advise in which tools to use on the
development. Any suggestion on IDE, compiler, regression testing software
and everything else is more than welcome. Of course this is highly
personal, but it would be great to know what you folks are using to help us
decide and start working.

Thanks for your help. We are really looking to cooperate with the project,
so we'll hopefully see you around here for a while!

Thanks for your help,

Manuel



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN


Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Gilles Gouaillardet
Ralph,

r32639 and r32642 fixes bugs that do exist in both trunk and v1.8, and they can 
be considered as independent of the issue that is discussed in this thread and 
the one you pointed.

so imho, they should land v1.8 even if they do not fix the issue we are now 
discussing here

Cheers,

Gilles


On 2014/08/29 16:42, Ralph Castain wrote:
> This is the email thread which sparked the problem:
>
> http://www.open-mpi.org/community/lists/devel/2014/07/15329.php
>
> I actually tried to apply the original CMR and couldn't get it to work in the 
> 1.8 branch - just kept having problems, so I pushed it off to 1.8.3. I'm 
> leery to accept either of the current CMRs for two reasons: (a) none of the 
> preceding changes is in the 1.8 series yet, and (b) it doesn't sound like we 
> still have a complete solution.
>
> Anyway, I just wanted to point to the original problem that was trying to be 
> addressed.
>
>
> On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet 
>  wrote:
>
>> Howard and Edgar,
>>
>> i fixed a few bugs (r32639 and r32642)
>>
>> the bug is trivial to reproduce with any mpi hello world program
>>
>> mpirun -np 2 --mca btl openib,self hello_world
>>
>> after setting the mca param in the $HOME/.openmpi/mca-params.conf
>>
>> $ cat ~/.openmpi/mca-params.conf
>> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
>>
>> good news is the program does not crash with a glory SIGSEGV any more
>> bad news is the program will (nicely) abort for an incorrect reason :
>>
>> --
>> The Open MPI receive queue configuration for the OpenFabrics devices
>> on two nodes are incompatible, meaning that MPI processes on two
>> specific nodes were unable to communicate with each other.  This
>> generally happens when you are using OpenFabrics devices from
>> different vendors on the same network.  You should be able to use the
>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
>> queue configuration for all the devices in the MPI job, and therefore
>> be able to run successfully.
>>
>>  Local host:   node0
>>  Local adapter:mlx4_0 (vendor 0x2c9, part ID 4099)
>>  Local queues: S,12288,128,64,32:S,65536,128,64,3
>>
>>  Remote host:  node0
>>  Remote adapter:   (vendor 0x2c9, part ID 4099)
>>  Remote queues:   
>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
>>
>> the root cause is the remote host did not send its receive_queues to the
>> local host
>> (and hence the local host believes the remote hosts uses the default value)
>>
>> the logic was revamped vs v1.8, that is why v1.8 does not have such issue.
>>
>> i am still thinking what should be the right fix :
>> - one option is to send the receive queues
>> - an other option would be to differenciate value overrided in
>> mca-params.conf (should be always ok) of value overrided in the .ini
>>  (might want to double check local and remote values match)
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/08/29 7:02, Pritchard Jr., Howard wrote:
>>> Hi Edgar,
>>>
>>> Could you send me your conf file?  I'll try to reproduce it.
>>>
>>> Maybe run with --mca btl_base_verbose 20 or something to
>>> see what the code that is parsing this field in the conf file
>>> is finding.
>>>
>>>
>>> Howard
>>>
>>>
>>> -Original Message-
>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
>>> Sent: Thursday, August 28, 2014 3:40 PM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] segfault in openib component on trunk
>>>
>>> to add another piece of information that I just found, the segfault only 
>>> occurs if I have a particular mca parameter set in my mca-params.conf file, 
>>> namely
>>>
>>> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
>>>
>>> Has the syntax for this parameter changed, or should/can I get rid of it?
>>>
>>> Thanks
>>> Edgar
>>>
>>> On 08/28/2014 04:19 PM, Edgar Gabriel wrote:
 we are having recently problems running trunk with openib component 
 enabled on one of our clusters. The problem occurs right in the 
 initialization part, here is the stack right before the segfault:

 ---snip---
 (gdb) where
 #0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
 endpoint=0x7d9660) at btl_openib.c:470
 #1  0x7f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, 
 nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
 btl_openib.c:1093
 #2  0x7f106316102c in mca_bml_r2_add_procs (nprocs=2, 
 procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
 #3  0x7f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,
 nprocs=2) at pml_ob1.c:334
 #4  0x7f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, 
 requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790
 #5  0x7f1068273a2c in MPI_Init (argc=0x7fff22dd188c,
 

[OMPI devel] mpirun hangs when a task exits with a non zero code

2014-08-29 Thread Gilles Gouaillardet
Ralph and all,

The following trivial test hangs
/* it hangs at least 99% of the time in my environment, 1% is a race
condition and the program behaves as expected */

mpirun -np 1 --mca btl self /bin/false

same behaviour happen with the following trivial but MPI program :

#include 

int main (int argc, char *argv[]) {
MPI_Init(, );
MPI_Finalize();
return 1;
}

The attached patch fixes the hang (e.g. the program nicely abort with
the correct error message)

i did not commit it since i am not confident at all

could you please review it ?

Cheers

Gilles
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
===
--- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c(revision 32642)
+++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c(working copy)
@@ -10,6 +10,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.
  * All rights reserved.
  * Copyright (c) 2014  Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -382,6 +384,14 @@
 jdata->num_terminated++;
 }

+/* FIXME ???
+ * mark the proc as no more alive if needed
+ */
+if (ORTE_PROC_STATE_KILLED_BY_CMD == state) {
+if (ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_WAITPID) && 
ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_IOF_COMPLETE)) {
+ORTE_FLAG_UNSET(pptr, ORTE_PROC_FLAG_ALIVE);
+}
+}
 /* if we were ordered to terminate, mark this proc as dead and see if
  * any of our routes or local  children remain alive - if not, then
  * terminate ourselves. */


Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Ralph Castain
This is the email thread which sparked the problem:

http://www.open-mpi.org/community/lists/devel/2014/07/15329.php

I actually tried to apply the original CMR and couldn't get it to work in the 
1.8 branch - just kept having problems, so I pushed it off to 1.8.3. I'm leery 
to accept either of the current CMRs for two reasons: (a) none of the preceding 
changes is in the 1.8 series yet, and (b) it doesn't sound like we still have a 
complete solution.

Anyway, I just wanted to point to the original problem that was trying to be 
addressed.


On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet 
 wrote:

> Howard and Edgar,
> 
> i fixed a few bugs (r32639 and r32642)
> 
> the bug is trivial to reproduce with any mpi hello world program
> 
> mpirun -np 2 --mca btl openib,self hello_world
> 
> after setting the mca param in the $HOME/.openmpi/mca-params.conf
> 
> $ cat ~/.openmpi/mca-params.conf
> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
> 
> good news is the program does not crash with a glory SIGSEGV any more
> bad news is the program will (nicely) abort for an incorrect reason :
> 
> --
> The Open MPI receive queue configuration for the OpenFabrics devices
> on two nodes are incompatible, meaning that MPI processes on two
> specific nodes were unable to communicate with each other.  This
> generally happens when you are using OpenFabrics devices from
> different vendors on the same network.  You should be able to use the
> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
> queue configuration for all the devices in the MPI job, and therefore
> be able to run successfully.
> 
>  Local host:   node0
>  Local adapter:mlx4_0 (vendor 0x2c9, part ID 4099)
>  Local queues: S,12288,128,64,32:S,65536,128,64,3
> 
>  Remote host:  node0
>  Remote adapter:   (vendor 0x2c9, part ID 4099)
>  Remote queues:   
> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
> 
> the root cause is the remote host did not send its receive_queues to the
> local host
> (and hence the local host believes the remote hosts uses the default value)
> 
> the logic was revamped vs v1.8, that is why v1.8 does not have such issue.
> 
> i am still thinking what should be the right fix :
> - one option is to send the receive queues
> - an other option would be to differenciate value overrided in
> mca-params.conf (should be always ok) of value overrided in the .ini
>  (might want to double check local and remote values match)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/29 7:02, Pritchard Jr., Howard wrote:
>> Hi Edgar,
>> 
>> Could you send me your conf file?  I'll try to reproduce it.
>> 
>> Maybe run with --mca btl_base_verbose 20 or something to
>> see what the code that is parsing this field in the conf file
>> is finding.
>> 
>> 
>> Howard
>> 
>> 
>> -Original Message-
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
>> Sent: Thursday, August 28, 2014 3:40 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] segfault in openib component on trunk
>> 
>> to add another piece of information that I just found, the segfault only 
>> occurs if I have a particular mca parameter set in my mca-params.conf file, 
>> namely
>> 
>> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
>> 
>> Has the syntax for this parameter changed, or should/can I get rid of it?
>> 
>> Thanks
>> Edgar
>> 
>> On 08/28/2014 04:19 PM, Edgar Gabriel wrote:
>>> we are having recently problems running trunk with openib component 
>>> enabled on one of our clusters. The problem occurs right in the 
>>> initialization part, here is the stack right before the segfault:
>>> 
>>> ---snip---
>>> (gdb) where
>>> #0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
>>> endpoint=0x7d9660) at btl_openib.c:470
>>> #1  0x7f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, 
>>> nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
>>> btl_openib.c:1093
>>> #2  0x7f106316102c in mca_bml_r2_add_procs (nprocs=2, 
>>> procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
>>> #3  0x7f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,
>>> nprocs=2) at pml_ob1.c:334
>>> #4  0x7f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, 
>>> requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790
>>> #5  0x7f1068273a2c in MPI_Init (argc=0x7fff22dd188c,
>>> argv=0x7fff22dd1880) at init.c:84
>>> #6  0x004008e7 in main (argc=1, argv=0x7fff22dd1da8) at
>>> hello_world.c:13
>>> ---snip---
>>> 
>>> 
>>> in line 538 of the file containing the mca_btl_openib_tune_endpoint 
>>> routine, the strcmp operation fails, because  recv_qps is a NULL pointer.
>>> 
>>> 
>>> ---snip---
>>> 
>>> if(0 != strcmp(mca_btl_openib_component.receive_queues, recv_qps)) {
>>> 
>>> ---snip---
>>> 
>>> Does anybody have an 

Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Gilles Gouaillardet
Howard and Edgar,

i fixed a few bugs (r32639 and r32642)

the bug is trivial to reproduce with any mpi hello world program

mpirun -np 2 --mca btl openib,self hello_world

after setting the mca param in the $HOME/.openmpi/mca-params.conf

$ cat ~/.openmpi/mca-params.conf
btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3

good news is the program does not crash with a glory SIGSEGV any more
bad news is the program will (nicely) abort for an incorrect reason :

--
The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other.  This
generally happens when you are using OpenFabrics devices from
different vendors on the same network.  You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.

  Local host:   node0
  Local adapter:mlx4_0 (vendor 0x2c9, part ID 4099)
  Local queues: S,12288,128,64,32:S,65536,128,64,3

  Remote host:  node0
  Remote adapter:   (vendor 0x2c9, part ID 4099)
  Remote queues:   
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

the root cause is the remote host did not send its receive_queues to the
local host
(and hence the local host believes the remote hosts uses the default value)

the logic was revamped vs v1.8, that is why v1.8 does not have such issue.

i am still thinking what should be the right fix :
- one option is to send the receive queues
- an other option would be to differenciate value overrided in
mca-params.conf (should be always ok) of value overrided in the .ini
  (might want to double check local and remote values match)

Cheers,

Gilles

On 2014/08/29 7:02, Pritchard Jr., Howard wrote:
> Hi Edgar,
>
> Could you send me your conf file?  I'll try to reproduce it.
>
> Maybe run with --mca btl_base_verbose 20 or something to
> see what the code that is parsing this field in the conf file
> is finding.
>
>
> Howard
>
>
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
> Sent: Thursday, August 28, 2014 3:40 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] segfault in openib component on trunk
>
> to add another piece of information that I just found, the segfault only 
> occurs if I have a particular mca parameter set in my mca-params.conf file, 
> namely
>
> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
>
> Has the syntax for this parameter changed, or should/can I get rid of it?
>
> Thanks
> Edgar
>
> On 08/28/2014 04:19 PM, Edgar Gabriel wrote:
>> we are having recently problems running trunk with openib component 
>> enabled on one of our clusters. The problem occurs right in the 
>> initialization part, here is the stack right before the segfault:
>>
>> ---snip---
>> (gdb) where
>> #0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
>> endpoint=0x7d9660) at btl_openib.c:470
>> #1  0x7f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, 
>> nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
>> btl_openib.c:1093
>> #2  0x7f106316102c in mca_bml_r2_add_procs (nprocs=2, 
>> procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
>> #3  0x7f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,
>> nprocs=2) at pml_ob1.c:334
>> #4  0x7f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, 
>> requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790
>> #5  0x7f1068273a2c in MPI_Init (argc=0x7fff22dd188c,
>> argv=0x7fff22dd1880) at init.c:84
>> #6  0x004008e7 in main (argc=1, argv=0x7fff22dd1da8) at
>> hello_world.c:13
>> ---snip---
>>
>>
>> in line 538 of the file containing the mca_btl_openib_tune_endpoint 
>> routine, the strcmp operation fails, because  recv_qps is a NULL pointer.
>>
>>
>> ---snip---
>>
>> if(0 != strcmp(mca_btl_openib_component.receive_queues, recv_qps)) {
>>
>> ---snip---
>>
>> Does anybody have an idea on what might be going wrong and how to 
>> resolve it? Just to confirm, everything works perfectly with the 1.8 
>> series on that very same  cluster
>>
>> Thanks
>> Edgar
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15746.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15747.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: