date:20141212

[OMPI devel] [1.8.4rc3+patches] Solaris status summary

2014-12-12 Thread Paul Hargrove

It appears that with Ralph's oob_tcp patches (paul.diff) everything is now
OK on Solaris-11/x86-64.

On Solaris-10/SPARC I needed to fix guess_strlen() (or change "%u" to "%d"
to avoid the issue) or else I didn't get very far at all (SEGV in orterun).
However, with that issue resolved things are still not "golden".

I have applied the oob_tcp patches and rebuilt on the Solaris-10/SPARC
system.
I had hoped it would fix an interrupted select warning I'd seen.
However, it is still there along with the loopback-if warning and one about
a failed accept().
Output is below.

-Paul

$ mpirun -mca btl sm,self -np 2 examples/ring_c
--
WARNING: No loopback interface was found. This can cause problems
when we spawn processes as they are likely to be unable to connect
back to their host daemon. Sadly, it may take awhile for the connect
attempt to fail, so you may experience a significant hang time.

You may wish to ctrl-c out of your job and activate loopback support
on at least one interface before trying again.

--
select: Interrupted system call
--
WARNING: No loopback interface was found. This can cause problems
when we spawn processes as they are likely to be unable to connect
back to their host daemon. Sadly, it may take awhile for the connect
attempt to fail, so you may experience a significant hang time.

You may wish to ctrl-c out of your job and activate loopback support
on at least one interface before trying again.

--
--
WARNING: No loopback interface was found. This can cause problems
when we spawn processes as they are likely to be unable to connect
back to their host daemon. Sadly, it may take awhile for the connect
attempt to fail, so you may experience a significant hang time.

You may wish to ctrl-c out of your job and activate loopback support
on at least one interface before trying again.

--
[xxx.xxx.xxx.xxx:09934] mca_oob_tcp_accept: accept() failed: Resource
temporarily unavailable (11).
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-509-g38d6627

2014-12-12 Thread Ralph Castain

Nathan - does this need to come to 1.8.4? Or do you want to go with Paul’s 
suggested fix?

> On Dec 12, 2014, at 8:09 AM, git...@crest.iu.edu wrote:
> 
> This is an automated email from the git hooks/post-receive script. It was
> generated because a ref change was pushed to the repository containing
> the project "open-mpi/ompi".
> 
> The branch, master has been updated
>   via  38d66272c51fd531181d9dc282a7260f40270f64 (commit)
>  from  f4aecdbfd22a74feadab5566d2d595b65be4a8cb (commit)
> 
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
> 
> - Log -
> https://github.com/open-mpi/ompi/commit/38d66272c51fd531181d9dc282a7260f40270f64
> 
> commit 38d66272c51fd531181d9dc282a7260f40270f64
> Author: Nathan Hjelm 
> Date:   Fri Dec 12 09:09:01 2014 -0700
> 
>btl/vader: fix compile on SGI UV
> 
> diff --git a/opal/mca/btl/vader/btl_vader_component.c 
> b/opal/mca/btl/vader/btl_vader_component.c
> index 7061612..aabf03d 100644
> --- a/opal/mca/btl/vader/btl_vader_component.c
> +++ b/opal/mca/btl/vader/btl_vader_component.c
> @@ -354,9 +354,8 @@ static void mca_btl_vader_check_single_copy (void)
> #if OPAL_BTL_VADER_HAVE_XPMEM
> if (MCA_BTL_VADER_XPMEM == mca_btl_vader_component.single_copy_mechanism) 
> {
> /* try to create an xpmem segment for the entire address space */
> -mca_btl_vader_component.my_seg_id = xpmem_make (0, 
> VADER_MAX_ADDRESS, XPMEM_PERMIT_MODE, (void *)0666);
> -
> -if (-1 == mca_btl_vader_component.my_seg_id) {
> +rc = mca_btl_vader_xpmem_init ();
> +if (OPAL_SUCCESS != rc) {
> if (MCA_BTL_VADER_XPMEM == initial_mechanism) {
> opal_show_help("help-btl-vader.txt", "xpmem-make-failed",
>true, opal_process_info.nodename, errno,
> @@ -364,11 +363,7 @@ static void mca_btl_vader_check_single_copy (void)
> }
> 
> mca_btl_vader_select_next_single_copy_mechanism ();
> -} else {
> -mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> -mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> }
> -
> }
> #endif
> 
> diff --git a/opal/mca/btl/vader/btl_vader_xpmem.c 
> b/opal/mca/btl/vader/btl_vader_xpmem.c
> index 7e362ea..4bb9a3b 100644
> --- a/opal/mca/btl/vader/btl_vader_xpmem.c
> +++ b/opal/mca/btl/vader/btl_vader_xpmem.c
> @@ -19,6 +19,19 @@
> 
> #if OPAL_BTL_VADER_HAVE_XPMEM
> 
> +int mca_btl_vader_xpmem_init (void)
> +{
> +mca_btl_vader_component.my_seg_id = xpmem_make (0, VADER_MAX_ADDRESS, 
> XPMEM_PERMIT_MODE, (void *)0666);
> +if (-1 == mca_btl_vader_component.my_seg_id) {
> +return OPAL_ERR_NOT_AVAILABLE;
> +}
> +
> +mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> +mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> +
> +return OPAL_SUCCESS;
> +}
> +
> /* look up the remote pointer in the peer rcache and attach if
>  * necessary */
> mca_mpool_base_registration_t *vader_get_registation (struct 
> mca_btl_base_endpoint_t *ep, void *rem_ptr,
> diff --git a/opal/mca/btl/vader/btl_vader_xpmem.h 
> b/opal/mca/btl/vader/btl_vader_xpmem.h
> index 1be188a..e040e26 100644
> --- a/opal/mca/btl/vader/btl_vader_xpmem.h
> +++ b/opal/mca/btl/vader/btl_vader_xpmem.h
> @@ -22,6 +22,7 @@
>   #include 
> 
>   typedef int64_t xpmem_segid_t;
> +  typedef int64_t xpmem_apid_t;
> #endif
> 
> /* look up the remote pointer in the peer rcache and attach if
> @@ -30,6 +31,8 @@
> /* largest address we can attach to using xpmem */
> #define VADER_MAX_ADDRESS ((uintptr_t)0x7000ul)
> 
> +int mca_btl_vader_xpmem_init (void);
> +
> mca_mpool_base_registration_t *vader_get_registation (struct 
> mca_btl_base_endpoint_t *endpoint, void *rem_ptr,
> size_t size, int flags, 
> void **local_ptr);
> 
> 
> 
> ---
> 
> Summary of changes:
> opal/mca/btl/vader/btl_vader_component.c |  9 ++---
> opal/mca/btl/vader/btl_vader_xpmem.c | 13 +
> opal/mca/btl/vader/btl_vader_xpmem.h |  3 +++
> 3 files changed, 18 insertions(+), 7 deletions(-)
> 
> 
> hooks/post-receive
> -- 
> open-mpi/ompi
> ___
> ompi-commits mailing list
> ompi-comm...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain

I suspect we’ll just remove it, but I want to give the other developers a 
chance to chime in before doing so.

> On Dec 12, 2014, at 6:07 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> If preserved at all, the existing code should probably be made to act more 
> intelligently when it encounters an unknown escape code.  I would suggest 
> advancing the length by some value (say 128?) that should be "big enough" and 
> printing a prominent warning.  So, the next time this bug surfaces it will be 
> (a) non-fatal and (b) easy to pin down.
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 5:46 PM, Ralph Castain  > wrote:
> Looking at the comments in the code, it appears that the rationale when 
> written was to provide support for REALLY ancient systems that didn’t have 
> some of these functions. Since that time, we added a configure check for 
> vsnprintf, so I’m adding Paul/Larry’s suggested code, protected by that 
> configure.
> 
> Since I suspect the configure check will always pass on any system of 
> interest today, I think this will solve the problem. We can then address the 
> broader question (e.g., do we even need this stuff any more at all?) in a 
> more leisurely way.
> 
> 
>> On Dec 12, 2014, at 5:42 PM, Larry Baker > > wrote:
>> 
>> On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:
>> 
>>> HOWEVER, while the patch catches the "%u" case, there are plenty of 
>>> potential ways to hit the same problem if, for instance, one uses "%zu" for 
>>> size_t.  Additionally, I've already noted that the code for "%ld", "%lx", 
>>> "%lX", "%lf" are all currently incorrect.
>> 
>> 
>> Not sure if it is applicable, but C99 has an  header which 
>> #include's  and provides additional capabilities, such as 
>> printf()/scanf() format macros for the types defined in .
>> 
>> Larry Baker
>> US Geological Survey
>> 650-329-5608 
>> ba...@usgs.gov 
>> 
> 
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16578.php

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove

Ralph,

The Solaris-11/x86-64 system is now "all good" with those changes.
Works with "-mca oob_tcp_if_include bge0", "-mca oob_tcp_if_exclude bge0"
and with neither.

I next check if this fixes the interrupted select warnings seen on
Solaris-10/SPARC.

-Paul

On Fri, Dec 12, 2014 at 5:17 PM, Ralph Castain  wrote:

> No need for autogen - simple change to a couple of files
>
>
>
> On Dec 12, 2014, at 4:38 PM, Paul Hargrove  wrote:
>
> Ralph,
>
> Patches to *code* are fine, but I am not equipped to autogen.
>
> -Paul
>
> On Fri, Dec 12, 2014 at 4:37 PM, Ralph Castain  wrote:
>
>> Would you be open to a patch you can test instead of me rolling an rc?
>> I'd be happy to send one in a while
>>
>> On Dec 12, 2014, at 4:34 PM, Ralph Castain  wrote:
>>
>> I'm hoping it will fix it. The timeout code was the only change from
>> 1.8.3 besides the loopback warning, so it should restore the prior behavior.
>>
>>
>> On Dec 12, 2014, at 4:32 PM, Paul Hargrove  wrote:
>>
>>
>> On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain  wrote:
>>
>>> All right - I'll surrender and remove the timeout. Will release rc4
>>> later tonight.
>>>
>>> Sorry for putting you thru this Paul - for some reason, these problems
>>> aren't showing up elsewhere.
>>>
>>
>> Even at a 300s timeout I don't get a connection.
>> Is rc4 expected to fix that, or are we still "fishing"?
>>
>> -Paul
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>  ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16568.php
>>
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16570.php
>>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16571.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16572.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove

Ralph,

If preserved at all, the existing code should probably be made to act more
intelligently when it encounters an unknown escape code.  I would suggest
advancing the length by some value (say 128?) that should be "big enough"
and printing a prominent warning.  So, the next time this bug surfaces it
will be (a) non-fatal and (b) easy to pin down.

-Paul

On Fri, Dec 12, 2014 at 5:46 PM, Ralph Castain  wrote:

> Looking at the comments in the code, it appears that the rationale when
> written was to provide support for REALLY ancient systems that didn't have
> some of these functions. Since that time, we added a configure check for
> vsnprintf, so I'm adding Paul/Larry's suggested code, protected by that
> configure.
>
> Since I suspect the configure check will always pass on any system of
> interest today, I think this will solve the problem. We can then address
> the broader question (e.g., do we even need this stuff any more at all?) in
> a more leisurely way.
>
>
> On Dec 12, 2014, at 5:42 PM, Larry Baker  wrote:
>
> On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:
>
> HOWEVER, while the patch catches the "%u" case, there are plenty of
> potential ways to hit the same problem if, for instance, one uses "%zu" for
> size_t.  Additionally, I've already noted that the code for "%ld", "%lx",
> "%lX", "%lf" are all currently incorrect.
>
>
> Not sure if it is applicable, but C99 has an  header which
> #include's  and provides additional capabilities, such as
> printf()/scanf() format macros for the types defined in .
>
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
>
>
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain

Looking at the comments in the code, it appears that the rationale when written 
was to provide support for REALLY ancient systems that didn’t have some of 
these functions. Since that time, we added a configure check for vsnprintf, so 
I’m adding Paul/Larry’s suggested code, protected by that configure.

Since I suspect the configure check will always pass on any system of interest 
today, I think this will solve the problem. We can then address the broader 
question (e.g., do we even need this stuff any more at all?) in a more 
leisurely way.

> On Dec 12, 2014, at 5:42 PM, Larry Baker  wrote:
> 
> On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:
> 
>> HOWEVER, while the patch catches the "%u" case, there are plenty of 
>> potential ways to hit the same problem if, for instance, one uses "%zu" for 
>> size_t.  Additionally, I've already noted that the code for "%ld", "%lx", 
>> "%lX", "%lf" are all currently incorrect.
> 
> 
> Not sure if it is applicable, but C99 has an  header which 
> #include's  and provides additional capabilities, such as 
> printf()/scanf() format macros for the types defined in .
> 
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov 
>

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Larry Baker

On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:

> HOWEVER, while the patch catches the "%u" case, there are plenty of potential 
> ways to hit the same problem if, for instance, one uses "%zu" for size_t.  
> Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" 
> are all currently incorrect.

Not sure if it is applicable, but C99 has an  header which 
#include's  and provides additional capabilities, such as 
printf()/scanf() format macros for the types defined in .

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Larry Baker

Or, slightly modified using a defensive coding style:

>   return 1 + vsnprintf(dummy, sizeof( dummy ), fmt, ap);

if you like sizeof() [which I prefer].  if you like sizeof:

>   return 1 + vsnprintf(dummy, sizeof dummy, fmt, ap);
> 


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:

> OK, applying my attached patch (based on Gilles's observation) resolved the 
> problem!
> So I fully expect Ralph's plan to use "%d" to also resolve this.
> 
> HOWEVER, while the patch catches the "%u" case, there are plenty of potential 
> ways to hit the same problem if, for instance, one uses "%zu" for size_t.  
> Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" 
> are all currently incorrect.
> 
> So, I ask: "Why isn't guess_strlen() just implemented as follows?"
> 
> /* From man vsnprintf:
>  *The functions snprintf and vsnprintf do not write more  than
>  * size  bytes (including the trailing '\0').  If the output was truncated
>  * due to this limit then the return value is  the  number  of  characters
>  * (not  including the trailing '\0') which would have been written to the
>  * final string if enough space had been available. 
>  */
> static int guess_strlen(const char *fmt, va_list ap)
> { 
>   char dummy[1];
>   return 1 + vsnprintf(dummy, 1, fmt, ap);
> }
> 
> 
> BTW: I do see some messages like "select: Interrupted system call" which I 
> assume are related to the timeout code (and thus the subject of a different 
> thread).
> 
> 
> -Paul 
> 
> On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  wrote:
> Thanks, Gilles!
> 
> I was looking at that same code just now and completely missed the lack of a 
> case for '%u' (and '%lu').  I will add one now and see if that resolves the 
> problem
> 
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet 
>  wrote:
> Ralph,
> 
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
> 
> Makes sense ?
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain  wrote:
> Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
> down there. This is at the beginning of orte_init, so there are no threads 
> running nor has anything much happened.
> 
> Do you have any suggestions?
> 
> 
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>> 
>> Ralph,
>> 
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>> 
>> And so is "fmt":
>> 
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>> 
>> However, things have gone bad in guess_strlen():
>> 
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>> 
>> -Paul
>> 
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>> Hmmm….this is really odd. I actually do have a protection for that arch 
>> value being NULL, and you are in the code section when it isn’t.
>> 
>> Do you still have the core file around? If so, can you print out the value 
>> of the “arch” variable? It would be in the 
>> opal_hwloc_base_get_topo_signature level.
>> 
>> I’m wondering if that value has been hosed, and the problem is memory 
>> corruption somewhere.
>> 
>> 
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>> 
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>>> returning an architecture type for some reason, and I didn’t protect 
>>> against it.
>>> 
>>> 
 On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
 
 Backtrace for the Solaris-10/SPARC SEGV appears below.
 I've changed the subject line to distinguish this from the earlier report.
 
 -Paul
 
 program terminated by signal SEGV (no mapping at the fault address)
 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
 Current function is guess_strlen
71   len += (int)strlen(sarg);
 (dbx) where
   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
 0x7d93b634 
 =>[2] guess_strlen(fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
 "printf.c"
   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
 "printf.c"
   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
 "printf.c"
   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
 "hwloc_base_util.c"
>>

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain

It’s a fair question - that code is ancient, however, so I’m surprised it has 
only surfaced now as a problem. I can take a look at making the change


> On Dec 12, 2014, at 5:22 PM, Paul Hargrove  wrote:
> 
> OK, applying my attached patch (based on Gilles's observation) resolved the 
> problem!
> So I fully expect Ralph's plan to use "%d" to also resolve this.
> 
> HOWEVER, while the patch catches the "%u" case, there are plenty of potential 
> ways to hit the same problem if, for instance, one uses "%zu" for size_t.  
> Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" 
> are all currently incorrect.
> 
> So, I ask: "Why isn't guess_strlen() just implemented as follows?"
> 
> /* From man vsnprintf:
>  *The functions snprintf and vsnprintf do not write more  than
>  * size  bytes (including the trailing '\0').  If the output was truncated
>  * due to this limit then the return value is  the  number  of  characters
>  * (not  including the trailing '\0') which would have been written to the
>  * final string if enough space had been available. 
>  */
> static int guess_strlen(const char *fmt, va_list ap)
> { 
>   char dummy[1];
>   return 1 + vsnprintf(dummy, 1, fmt, ap);
> }
> 
> 
> BTW: I do see some messages like "select: Interrupted system call" which I 
> assume are related to the timeout code (and thus the subject of a different 
> thread).
> 
> 
> -Paul 
> 
> On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  > wrote:
> Thanks, Gilles!
> 
> I was looking at that same code just now and completely missed the lack of a 
> case for '%u' (and '%lu').  I will add one now and see if that resolves the 
> problem
> 
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet 
> mailto:gilles.gouaillar...@gmail.com>> wrote:
> Ralph,
> 
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
> 
> Makes sense ?
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain mailto:r...@open-mpi.org>> wrote:
> Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
> down there. This is at the beginning of orte_init, so there are no threads 
> running nor has anything much happened.
> 
> Do you have any suggestions?
> 
> 
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove > > wrote:
>> 
>> Ralph,
>> 
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>> 
>> And so is "fmt":
>> 
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>> 
>> However, things have gone bad in guess_strlen():
>> 
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>> 
>> -Paul
>> 
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain > > wrote:
>> Hmmm….this is really odd. I actually do have a protection for that arch 
>> value being NULL, and you are in the code section when it isn’t.
>> 
>> Do you still have the core file around? If so, can you print out the value 
>> of the “arch” variable? It would be in the 
>> opal_hwloc_base_get_topo_signature level.
>> 
>> I’m wondering if that value has been hosed, and the problem is memory 
>> corruption somewhere.
>> 
>> 
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain >> > wrote:
>>> 
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>>> returning an architecture type for some reason, and I didn’t protect 
>>> against it.
>>> 
>>> 
 On Dec 11, 2014, at 7:39 PM, Paul Hargrove >>> > wrote:
 
 Backtrace for the Solaris-10/SPARC SEGV appears below.
 I've changed the subject line to distinguish this from the earlier report.
 
 -Paul
 
 program terminated by signal SEGV (no mapping at the fault address)
 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
 Current function is guess_strlen
71   len += (int)strlen(sarg);
 (dbx) where
   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
 0x7d93b634 
 =>[2] guess_strlen(fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
 "printf.c"
   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
 "printf.c"
   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
 "printf.c"
   [5] opal_hwloc_base_get_topo_s

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove

OK, applying my attached patch (based on Gilles's observation) resolved the
problem!
So I fully expect Ralph's plan to use "%d" to also resolve this.

HOWEVER, while the patch catches the "%u" case, there are plenty of
potential ways to hit the same problem if, for instance, one uses "%zu" for
size_t.  Additionally, I've already noted that the code for "%ld", "%lx",
"%lX", "%lf" are all currently incorrect.

So, I ask: "Why isn't guess_strlen() just implemented as follows?"

/* From man vsnprintf:
 *The functions snprintf and vsnprintf do not write more  than
 * size  bytes (including the trailing '\0').  If the output was truncated
 * due to this limit then the return value is  the  number  of  characters
 * (not  including the trailing '\0') which would have been written to the
 * final string if enough space had been available.
 */
static int guess_strlen(const char *fmt, va_list ap)
{
  char dummy[1];
  return 1 + vsnprintf(dummy, 1, fmt, ap);
}



BTW: I do see some messages like "select: Interrupted system call" which I
assume are related to the timeout code (and thus the subject of a different
thread).


-Paul

On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  wrote:

> Thanks, Gilles!
>
> I was looking at that same code just now and completely missed the lack of
> a case for '%u' (and '%lu').  I will add one now and see if that resolves
> the problem
>
>
> -Paul
>
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> I cannot find a case for the %u format is guess_strlen
>> And since the default does not invoke va_arg()
>> I
>> it seems strlen is invoked on nnuma instead of arch
>>
>> Makes sense ?
>>
>> Cheers,
>>
>> Gilles
>>
>> Ralph Castain  wrote:
>> Afraid I'm drawing a blank, Paul - I can't see how we got to a bad
>> address down there. This is at the beginning of orte_init, so there are no
>> threads running nor has anything much happened.
>>
>> Do you have any suggestions?
>>
>>
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>>
>> Ralph,
>>
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt,
>> arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>>
>> And so is "fmt":
>>
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>>
>> However, things have gone bad in guess_strlen():
>>
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>>
>> -Paul
>>
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>>
>>> Hmmmthis is really odd. I actually do have a protection for that arch
>>> value being NULL, and you are in the code section when it isn't.
>>>
>>> Do you still have the core file around? If so, can you print out the
>>> value of the "arch" variable? It would be in the
>>> opal_hwloc_base_get_topo_signature level.
>>>
>>> I'm wondering if that value has been hosed, and the problem is memory
>>> corruption somewhere.
>>>
>>>
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>>
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc
>>> isn't returning an architecture type for some reason, and I didn't protect
>>> against it.
>>>
>>>
>>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>>>
>>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>>> I've changed the subject line to distinguish this from the earlier
>>> report.
>>>
>>> -Paul
>>>
>>> program terminated by signal SEGV (no mapping at the fault address)
>>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>>> Current function is guess_strlen
>>>71   len += (int)strlen(sarg);
>>> (dbx) where
>>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
>>> 0x7d93b634
>>> =>[2] guess_strlen(fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
>>> "printf.c"
>>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
>>> "printf.c"
>>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
>>> "printf.c"
>>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134
>>> in "hwloc_base_util.c"
>>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610,
>>> flags = 4U), line 148 in "orte_init.c"
>>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in
>>> "orterun.c"
>>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>>>
>>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>>>
 No, that looks differe

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain

No need for autogen - simple change to a couple of files

paul.diff
Description: Binary data
On Dec 12, 2014, at 4:38 PM, Paul Hargrove  wrote:Ralph,Patches to code are fine, but I am not equipped to autogen.-PaulOn Fri, Dec 12, 2014 at 4:37 PM, Ralph Castain  wrote:Would you be open to a patch you can test instead of me rolling an rc? I’d be happy to send one in a whileOn Dec 12, 2014, at 4:34 PM, Ralph Castain  wrote:I’m hoping it will fix it. The timeout code was the only change from 1.8.3 besides the loopback warning, so it should restore the prior behavior.On Dec 12, 2014, at 4:32 PM, Paul Hargrove  wrote:On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain  wrote:All right - I’ll surrender and remove the timeout. Will release rc4 later tonight.Sorry for putting you thru this Paul - for some reason, these problems aren’t showing up elsewhere.Even at a 300s timeout I don't get a connection.Is rc4 expected to fix that, or are we still "fishing"?-Paul-- Paul H. Hargrove                          phhargr...@lbl.govComputer Languages & Systems Software (CLaSS) GroupComputer Science Department               Tel: +1-510-495-2352Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

___devel mailing listde...@open-mpi.orgSubscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/12/16568.php___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/12/16570.php-- Paul H. Hargrove                          phhargr...@lbl.govComputer Languages & Systems Software (CLaSS) GroupComputer Science Department               Tel: +1-510-495-2352Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

___devel mailing listde...@open-mpi.orgSubscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/12/16571.php

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove

Ralph,

Patches to *code* are fine, but I am not equipped to autogen.

-Paul

On Fri, Dec 12, 2014 at 4:37 PM, Ralph Castain  wrote:

> Would you be open to a patch you can test instead of me rolling an rc? I'd
> be happy to send one in a while
>
> On Dec 12, 2014, at 4:34 PM, Ralph Castain  wrote:
>
> I'm hoping it will fix it. The timeout code was the only change from 1.8.3
> besides the loopback warning, so it should restore the prior behavior.
>
>
> On Dec 12, 2014, at 4:32 PM, Paul Hargrove  wrote:
>
>
> On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain  wrote:
>
>> All right - I'll surrender and remove the timeout. Will release rc4 later
>> tonight.
>>
>> Sorry for putting you thru this Paul - for some reason, these problems
>> aren't showing up elsewhere.
>>
>
> Even at a 300s timeout I don't get a connection.
> Is rc4 expected to fix that, or are we still "fishing"?
>
> -Paul
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16568.php
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16570.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain

Would you be open to a patch you can test instead of me rolling an rc? I’d be 
happy to send one in a while

> On Dec 12, 2014, at 4:34 PM, Ralph Castain  wrote:
> 
> I’m hoping it will fix it. The timeout code was the only change from 1.8.3 
> besides the loopback warning, so it should restore the prior behavior.
> 
> 
>> On Dec 12, 2014, at 4:32 PM, Paul Hargrove > > wrote:
>> 
>> 
>> On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain > > wrote:
>> All right - I’ll surrender and remove the timeout. Will release rc4 later 
>> tonight.
>> 
>> Sorry for putting you thru this Paul - for some reason, these problems 
>> aren’t showing up elsewhere.
>> 
>> Even at a 300s timeout I don't get a connection.
>> Is rc4 expected to fix that, or are we still "fishing"?
>> 
>> -Paul
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov 
>> 
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16568.php
>

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain

I’m hoping it will fix it. The timeout code was the only change from 1.8.3 
besides the loopback warning, so it should restore the prior behavior.


> On Dec 12, 2014, at 4:32 PM, Paul Hargrove  wrote:
> 
> 
> On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain  > wrote:
> All right - I’ll surrender and remove the timeout. Will release rc4 later 
> tonight.
> 
> Sorry for putting you thru this Paul - for some reason, these problems aren’t 
> showing up elsewhere.
> 
> Even at a 300s timeout I don't get a connection.
> Is rc4 expected to fix that, or are we still "fishing"?
> 
> -Paul
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16568.php

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove

On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain  wrote:

> All right - I'll surrender and remove the timeout. Will release rc4 later
> tonight.
>
> Sorry for putting you thru this Paul - for some reason, these problems
> aren't showing up elsewhere.
>

Even at a 300s timeout I don't get a connection.
Is rc4 expected to fix that, or are we still "fishing"?

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain

Crud - sorry for delayed response. I was out for a bit.

I’ll just change it to %d as there is nothing magic about it being unsigned. 
How bizarre.


> On Dec 12, 2014, at 3:21 PM, Paul Hargrove  wrote:
> 
> NOTE:
> 
> The existing code for "%l." in guess_strlen() is garbage.
> The va_arg() macro calls all have "int" for the type!!
> 
> I am *only* testing a fix for the missing "%u" at the moment.
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  > wrote:
> Thanks, Gilles!
> 
> I was looking at that same code just now and completely missed the lack of a 
> case for '%u' (and '%lu').  I will add one now and see if that resolves the 
> problem
> 
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet 
> mailto:gilles.gouaillar...@gmail.com>> wrote:
> Ralph,
> 
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
> 
> Makes sense ?
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain mailto:r...@open-mpi.org>> wrote:
> Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
> down there. This is at the beginning of orte_init, so there are no threads 
> running nor has anything much happened.
> 
> Do you have any suggestions?
> 
> 
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove > > wrote:
>> 
>> Ralph,
>> 
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>> 
>> And so is "fmt":
>> 
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>> 
>> However, things have gone bad in guess_strlen():
>> 
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>> 
>> -Paul
>> 
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain > > wrote:
>> Hmmm….this is really odd. I actually do have a protection for that arch 
>> value being NULL, and you are in the code section when it isn’t.
>> 
>> Do you still have the core file around? If so, can you print out the value 
>> of the “arch” variable? It would be in the 
>> opal_hwloc_base_get_topo_signature level.
>> 
>> I’m wondering if that value has been hosed, and the problem is memory 
>> corruption somewhere.
>> 
>> 
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain >> > wrote:
>>> 
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>>> returning an architecture type for some reason, and I didn’t protect 
>>> against it.
>>> 
>>> 
 On Dec 11, 2014, at 7:39 PM, Paul Hargrove >>> > wrote:
 
 Backtrace for the Solaris-10/SPARC SEGV appears below.
 I've changed the subject line to distinguish this from the earlier report.
 
 -Paul
 
 program terminated by signal SEGV (no mapping at the fault address)
 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
 Current function is guess_strlen
71   len += (int)strlen(sarg);
 (dbx) where
   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
 0x7d93b634 
 =>[2] guess_strlen(fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
 "printf.c"
   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
 "printf.c"
   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
 "printf.c"
   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
 "hwloc_base_util.c"
   [6] rte_init(), line 205 in "ess_hnp_module.c"
   [7] orte_init(pargc = 0x761c, pargv = 0x7610, 
 flags = 4U), line 148 in "orte_init.c"
   [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
 
 On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain >>> > wrote:
 No, that looks different - it’s failing in mpirun itself. Can you get a 
 line number on it?
 
 Sorry for delay - I’m generating rc3 now
 
 
> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  > wrote:
> 
> Don't see an rc3 yet.
> 
> My Solaris-10/SPARC runs fail slightly differently (see below).
> It looks sufficiently similar that it MIGHT be the same root cause.
> However, lacking an rc3 to test I figured it would be better to repo

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain

All right - I’ll surrender and remove the timeout. Will release rc4 later 
tonight.

Sorry for putting you thru this Paul - for some reason, these problems aren’t 
showing up elsewhere.


> On Dec 12, 2014, at 3:37 PM, Paul Hargrove  wrote:
> 
> 
> 
> On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain  > wrote:
> Aha! You are the first to fall thru the timeout. How interesting.
> 
> When it comes to the release candidates, I seem to own a lot of "firsts".
> It is not as fun as one might imagine :-).
> 
> Can you please try adding “-mca oob_tcp_connect_timeout 5:0”?
> 
> That appeared to produce a timeout of about 5 SECONDS ("time mpirun" reports 
> 5.8s elapsed).  Was that really the intent?   No difference if I change "5:0" 
> to "5:00".  So, you might have an "extra" bug lurking there.
> 
> 
> New stderr attached for
>   $ mpirun -mca oob_tcp_if_include bge0 -mca oob_tcp_connect_timeout 5:0 -mca 
> oob_base_verbose 20 -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 
> examples/ring_c
> 
> Assuming "5:0" was intended to get a 5 MINUTE timeout, I also tried "-mca 
> oob_tcp_connect_timeout 300", and have also attached the resulting stderr.
> 
> No joy for either timeout value.
> 
> -Paul
> 
>  
> 
> On Dec 12, 2014, at 8:53 AM, Paul Hargrove  > wrote:
>> 
>> 
>> First, I want to ask what became of the issue discussed in this thread?
>>http://www.open-mpi.org/community/lists/devel/2014/11/16160.php 
>> 
>> I though we had concluded that one just needed -D_REENTRANT.
>> I mention that only for completeness, because I think my current problem is 
>> different.
>> 
>> The following works fine with 1.8.3, making the current behavior a 
>> regression.
>> 
>> I am still on the same system as that previous report, and still/again see a 
>> message like the following:
>> 
>> 
>> A process or daemon was unable to complete a TCP connection
>> to another process:
>>   Local host:pcp-j-19
>>   Remote host:   172.18.0.120
>> This is usually caused by a firewall on the remote host. Please
>> check that any firewall (e.g., iptables) has been disabled and
>> try again.
>> 
>> --
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>> [...etc...]
>> 
>> It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the 
>> address 172.18.0.120 are on different subnets.
>> 
>> I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at 
>> configure time (I didn't bother to check if it there by default now or not).
>> 
>> NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only 
>> the 172.16.0.120 subnet.
>> IN FACT, the message is the same with that option, other than "172.18" 
>> changing to "172.16".
>> 
>> I've attached the output generated by "-mca oob_base_verbose 20" both with 
>> and without the oob_tcp_if_include.
>> 
>> I should also note that that the following is my full mpirun command, which 
>> excludes the tcp BTL.
>> pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca 
>> btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c
>> 
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov 
>> 
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352 
>> 
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16551.php 
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16561.php 
> 
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listin

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove

On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain  wrote:

> Aha! You are the first to fall thru the timeout. How interesting.
>

When it comes to the release candidates, I seem to own a lot of "firsts".
It is not as fun as one might imagine :-).

Can you please try adding "-mca oob_tcp_connect_timeout 5:0"?
>

That appeared to produce a timeout of about 5 SECONDS ("time mpirun"
reports 5.8s elapsed).  Was that really the intent?   No difference if I
change "5:0" to "5:00".  So, you might have an "extra" bug lurking there.


New stderr attached for
  $ mpirun -mca oob_tcp_if_include bge0 -mca oob_tcp_connect_timeout 5:0
-mca oob_base_verbose 20 -mca btl sm,self,openib -np 2 -host
pcp-j-19,pcp-j-20 examples/ring_c

Assuming "5:0" was intended to get a 5 MINUTE timeout, I also tried "-mca
oob_tcp_connect_timeout 300", and have also attached the resulting stderr.

No joy for either timeout value.

-Paul



>
> On Dec 12, 2014, at 8:53 AM, Paul Hargrove  wrote:
>
>
>
> First, I want to ask what became of the issue discussed in this thread?
>http://www.open-mpi.org/community/lists/devel/2014/11/16160.php
> I though we had concluded that one just needed -D_REENTRANT.
> I mention that only for completeness, because I think my current problem
> is different.
>
> The following works fine with 1.8.3, making the current behavior a
> regression.
>
> I am still on the same system as that previous report, and still/again see
> a message like the following:
>
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:pcp-j-19
>   Remote host:   172.18.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> 
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> [...etc...]
>
> It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the
> address 172.18.0.120 are on different subnets.
>
> I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at
> configure time (I didn't bother to check if it there by default now or not).
>
> NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only
> the 172.16.0.120 subnet.
> IN FACT, the message is the same with that option, other than "172.18"
> changing to "172.16".
>
> I've attached the output generated by "-mca oob_base_verbose 20" both with
> and without the oob_tcp_if_include.
>
> I should also note that that the following is my full mpirun command,
> which excludes the tcp BTL.
> pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20
> -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c
>
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16551.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16561.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[pcp-j-20:07201] mca: base: components_register: registering oob components
[pcp-j-20:07201] mca: base: components_register: found loaded component tcp
[pcp-j-20:07201] mca: base: components_register: component tcp register 
function successful
[pcp-j-20:07201] mca: base: components_open: opening oob components
[pcp-j-20:07201] mca: base: components_open: found loaded component tcp
[pcp-j-20:07201] mca: base: components_open: component tcp open function 
successful
[pcp-j-20:07201] mca:oob:select: checking available component tcp
[pcp-j-20:07201] mca:oob:select: Querying component [tcp]
[pcp-j-20:07201] oob:tcp: component_available called
[pcp-j-20:07201] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-20:07201] [[32105,0],0] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-20:07201] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-20:07201] [[32105,0],0] oob:tcp:init adding 172.16.0.120 to our list of 
V4 connections
[pcp-j-20:07201] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-20:07201] [[32105,0],0] oob:tcp:init rejec

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove

NOTE:

The existing code for "%l." in guess_strlen() is garbage.
The va_arg() macro calls all have "int" for the type!!

I am *only* testing a fix for the missing "%u" at the moment.

-Paul

On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  wrote:

> Thanks, Gilles!
>
> I was looking at that same code just now and completely missed the lack of
> a case for '%u' (and '%lu').  I will add one now and see if that resolves
> the problem
>
>
> -Paul
>
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> I cannot find a case for the %u format is guess_strlen
>> And since the default does not invoke va_arg()
>> I
>> it seems strlen is invoked on nnuma instead of arch
>>
>> Makes sense ?
>>
>> Cheers,
>>
>> Gilles
>>
>> Ralph Castain  wrote:
>> Afraid I'm drawing a blank, Paul - I can't see how we got to a bad
>> address down there. This is at the beginning of orte_init, so there are no
>> threads running nor has anything much happened.
>>
>> Do you have any suggestions?
>>
>>
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>>
>> Ralph,
>>
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt,
>> arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>>
>> And so is "fmt":
>>
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>>
>> However, things have gone bad in guess_strlen():
>>
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>>
>> -Paul
>>
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>>
>>> Hmmmthis is really odd. I actually do have a protection for that arch
>>> value being NULL, and you are in the code section when it isn't.
>>>
>>> Do you still have the core file around? If so, can you print out the
>>> value of the "arch" variable? It would be in the
>>> opal_hwloc_base_get_topo_signature level.
>>>
>>> I'm wondering if that value has been hosed, and the problem is memory
>>> corruption somewhere.
>>>
>>>
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>>
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc
>>> isn't returning an architecture type for some reason, and I didn't protect
>>> against it.
>>>
>>>
>>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>>>
>>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>>> I've changed the subject line to distinguish this from the earlier
>>> report.
>>>
>>> -Paul
>>>
>>> program terminated by signal SEGV (no mapping at the fault address)
>>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>>> Current function is guess_strlen
>>>71   len += (int)strlen(sarg);
>>> (dbx) where
>>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
>>> 0x7d93b634
>>> =>[2] guess_strlen(fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
>>> "printf.c"
>>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
>>> "printf.c"
>>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
>>> "printf.c"
>>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134
>>> in "hwloc_base_util.c"
>>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610,
>>> flags = 4U), line 148 in "orte_init.c"
>>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in
>>> "orterun.c"
>>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>>>
>>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>>>
 No, that looks different - it's failing in mpirun itself. Can you get a
 line number on it?

 Sorry for delay - I'm generating rc3 now


 On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:

 Don't see an rc3 yet.

 My Solaris-10/SPARC runs fail slightly differently (see below).
 It looks sufficiently similar that it MIGHT be the same root cause.
 However, lacking an rc3 to test I figured it would be better to report
 this than to ignore it.

 The problem is present with both V8+ and V9 ABIs, and with both Gnu and
 Sun compilers.

 -Paul

 [niagara1:29881] *** Process received signal ***
 [niagara1:29881] Signal: Segmentation Fault (11)
 [niagara1:29881] Signal code: Address not mapped (1)
 [niagara1:29881] Failing at address: 2

 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
 ktrace_print+0x24

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove

Thanks, Gilles!

I was looking at that same code just now and completely missed the lack of
a case for '%u' (and '%lu').  I will add one now and see if that resolves
the problem


-Paul

On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Ralph,
>
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
>
> Makes sense ?
>
> Cheers,
>
> Gilles
>
> Ralph Castain  wrote:
> Afraid I'm drawing a blank, Paul - I can't see how we got to a bad address
> down there. This is at the beginning of orte_init, so there are no threads
> running nor has anything much happened.
>
> Do you have any suggestions?
>
>
> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>
> Ralph,
>
> The "arch" variable looks fine:
> Current function is opal_hwloc_base_get_topo_signature
>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
> (dbx) print arch
> arch = 0x1001700a0 "sun4v"
>
> And so is "fmt":
>
> Current function is opal_asprintf
>   194   length = opal_vasprintf(ptr, fmt, ap);
> (dbx) print fmt
> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>
> However, things have gone bad in guess_strlen():
>
> Current function is guess_strlen
>71   len += (int)strlen(sarg);
> (dbx) print sarg
> sarg = 0x2 ""
>
> -Paul
>
> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>
>> Hmmmthis is really odd. I actually do have a protection for that arch
>> value being NULL, and you are in the code section when it isn't.
>>
>> Do you still have the core file around? If so, can you print out the
>> value of the "arch" variable? It would be in the
>> opal_hwloc_base_get_topo_signature level.
>>
>> I'm wondering if that value has been hosed, and the problem is memory
>> corruption somewhere.
>>
>>
>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>
>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn't
>> returning an architecture type for some reason, and I didn't protect
>> against it.
>>
>>
>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>>
>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>> I've changed the subject line to distinguish this from the earlier report.
>>
>> -Paul
>>
>> program terminated by signal SEGV (no mapping at the fault address)
>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) where
>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
>> 0x7d93b634
>> =>[2] guess_strlen(fmt = 0x7eeada98
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
>> "printf.c"
>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
>> "printf.c"
>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
>> "printf.c"
>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134
>> in "hwloc_base_util.c"
>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610,
>> flags = 4U), line 148 in "orte_init.c"
>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in
>> "orterun.c"
>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>>
>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>>
>>> No, that looks different - it's failing in mpirun itself. Can you get a
>>> line number on it?
>>>
>>> Sorry for delay - I'm generating rc3 now
>>>
>>>
>>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
>>>
>>> Don't see an rc3 yet.
>>>
>>> My Solaris-10/SPARC runs fail slightly differently (see below).
>>> It looks sufficiently similar that it MIGHT be the same root cause.
>>> However, lacking an rc3 to test I figured it would be better to report
>>> this than to ignore it.
>>>
>>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and
>>> Sun compilers.
>>>
>>> -Paul
>>>
>>> [niagara1:29881] *** Process received signal ***
>>> [niagara1:29881] Signal: Segmentation Fault (11)
>>> [niagara1:29881] Signal code: Address not mapped (1)
>>> [niagara1:29881] Failing at address: 2
>>>
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
>>> ktrace_print+0x24
>>>
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>>> /lib/libc.so.1:0xc5364
>>> /lib/libc.so.1:0xb9e64
>>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>>>
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vas
>>> printf+0x20
>>>
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Gilles Gouaillardet

Ralph,

I cannot find a case for the %u format is guess_strlen
And since the default does not invoke va_arg()
I
it seems strlen is invoked on nnuma instead of arch

Makes sense ?

Cheers,

Gilles

Ralph Castain  wrote:
>Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
>down there. This is at the beginning of orte_init, so there are no threads 
>running nor has anything much happened.
>
>
>Do you have any suggestions?
>
>
>
>On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>
>
>Ralph,
>
>
>The "arch" variable looks fine:
>
>Current function is opal_hwloc_base_get_topo_signature
>
> 2134                    nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>
>(dbx) print arch
>
>arch = 0x1001700a0 "sun4v"
>
>
>And so is "fmt":
>
>
>Current function is opal_asprintf
>
>  194       length = opal_vasprintf(ptr, fmt, ap);
>
>(dbx) print fmt
>
>fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>
>
>However, things have gone bad in guess_strlen():
>
>
>Current function is guess_strlen
>
>   71                       len += (int)strlen(sarg);
>
>(dbx) print sarg
>
>sarg = 0x2 ""
>
>
>-Paul
>
>
>On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>
>Hmmm….this is really odd. I actually do have a protection for that arch value 
>being NULL, and you are in the code section when it isn’t.
>
>
>Do you still have the core file around? If so, can you print out the value of 
>the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature 
>level.
>
>
>I’m wondering if that value has been hosed, and the problem is memory 
>corruption somewhere.
>
>
>
>On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>
>
>Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>returning an architecture type for some reason, and I didn’t protect against 
>it.
>
>
>
>On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>
>
>Backtrace for the Solaris-10/SPARC SEGV appears below.
>
>I've changed the subject line to distinguish this from the earlier report.
>
>
>-Paul
>
>
>program terminated by signal SEGV (no mapping at the fault address)
>
>0x7d93b634: strlen+0x0014:      lduh     [%o2], %o1
>
>Current function is guess_strlen
>
>   71                       len += (int)strlen(sarg);
>
>(dbx) where
>
>  [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
>0x7d93b634 
>
>=>[2] guess_strlen(fmt = 0x7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
>"printf.c"
>
>  [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
>"printf.c"
>
>  [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
>"printf.c"
>
>  [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
>"hwloc_base_util.c"
>
>  [6] rte_init(), line 205 in "ess_hnp_module.c"
>
>  [7] orte_init(pargc = 0x761c, pargv = 0x7610, flags 
>= 4U), line 148 in "orte_init.c"
>
>  [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
>
>  [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>
>
>On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>
>No, that looks different - it’s failing in mpirun itself. Can you get a line 
>number on it?
>
>
>Sorry for delay - I’m generating rc3 now
>
>
>
>On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
>
>
>Don't see an rc3 yet.
>
>
>My Solaris-10/SPARC runs fail slightly differently (see below).
>
>It looks sufficiently similar that it MIGHT be the same root cause.
>
>However, lacking an rc3 to test I figured it would be better to report this 
>than to ignore it.
>
>
>The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun 
>compilers.
>
>
>-Paul
>
>
>[niagara1:29881] *** Process received signal ***
>
>[niagara1:29881] Signal: Segmentation Fault (11)
>
>[niagara1:29881] Signal code: Address not mapped (1)
>
>[niagara1:29881] Failing at address: 2
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>
>/lib/libc.so.1:0xc5364
>
>/lib/libc.so.1:0xb9e64
>
>/lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris1

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain

Aha! You are the first to fall thru the timeout. How interesting.

Can you please try adding “-mca oob_tcp_connect_timeout 5:0”?

On Dec 12, 2014, at 8:53 AM, Paul Hargrove  wrote:
> 
> 
> First, I want to ask what became of the issue discussed in this thread?
>http://www.open-mpi.org/community/lists/devel/2014/11/16160.php 
> 
> I though we had concluded that one just needed -D_REENTRANT.
> I mention that only for completeness, because I think my current problem is 
> different.
> 
> The following works fine with 1.8.3, making the current behavior a regression.
> 
> I am still on the same system as that previous report, and still/again see a 
> message like the following:
> 
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:pcp-j-19
>   Remote host:   172.18.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> 
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> [...etc...]
> 
> It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the 
> address 172.18.0.120 are on different subnets.
> 
> I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at 
> configure time (I didn't bother to check if it there by default now or not).
> 
> NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only 
> the 172.16.0.120 subnet.
> IN FACT, the message is the same with that option, other than "172.18" 
> changing to "172.16".
> 
> I've attached the output generated by "-mca oob_base_verbose 20" both with 
> and without the oob_tcp_if_include.
> 
> I should also note that that the following is my full mpirun command, which 
> excludes the tcp BTL.
> pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca 
> btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c
> 
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16551.php

Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain

Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address down 
there. This is at the beginning of orte_init, so there are no threads running 
nor has anything much happened.

Do you have any suggestions?


> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> The "arch" variable looks fine:
> Current function is opal_hwloc_base_get_topo_signature
>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
> (dbx) print arch
> arch = 0x1001700a0 "sun4v"
> 
> And so is "fmt":
> 
> Current function is opal_asprintf
>   194   length = opal_vasprintf(ptr, fmt, ap);
> (dbx) print fmt
> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
> 
> However, things have gone bad in guess_strlen():
> 
> Current function is guess_strlen
>71   len += (int)strlen(sarg);
> (dbx) print sarg
> sarg = 0x2 ""
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  > wrote:
> Hmmm….this is really odd. I actually do have a protection for that arch value 
> being NULL, and you are in the code section when it isn’t.
> 
> Do you still have the core file around? If so, can you print out the value of 
> the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature 
> level.
> 
> I’m wondering if that value has been hosed, and the problem is memory 
> corruption somewhere.
> 
> 
>> On Dec 11, 2014, at 8:56 PM, Ralph Castain > > wrote:
>> 
>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>> returning an architecture type for some reason, and I didn’t protect against 
>> it.
>> 
>> 
>>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove >> > wrote:
>>> 
>>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>>> I've changed the subject line to distinguish this from the earlier report.
>>> 
>>> -Paul
>>> 
>>> program terminated by signal SEGV (no mapping at the fault address)
>>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>>> Current function is guess_strlen
>>>71   len += (int)strlen(sarg);
>>> (dbx) where
>>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
>>> 0x7d93b634 
>>> =>[2] guess_strlen(fmt = 0x7eeada98 
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
>>> "printf.c"
>>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
>>> "printf.c"
>>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
>>> "printf.c"
>>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
>>> "hwloc_base_util.c"
>>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610, 
>>> flags = 4U), line 148 in "orte_init.c"
>>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
>>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>>> 
>>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain >> > wrote:
>>> No, that looks different - it’s failing in mpirun itself. Can you get a 
>>> line number on it?
>>> 
>>> Sorry for delay - I’m generating rc3 now
>>> 
>>> 
 On Dec 11, 2014, at 6:59 PM, Paul Hargrove >>> > wrote:
 
 Don't see an rc3 yet.
 
 My Solaris-10/SPARC runs fail slightly differently (see below).
 It looks sufficiently similar that it MIGHT be the same root cause.
 However, lacking an rc3 to test I figured it would be better to report 
 this than to ignore it.
 
 The problem is present with both V8+ and V9 ABIs, and with both Gnu and 
 Sun compilers.
 
 -Paul
 
 [niagara1:29881] *** Process received signal ***
 [niagara1:29881] Signal: Segmentation Fault (11)
 [niagara1:29881] Signal code: Address not mapped (1)
 [niagara1:29881] Failing at address: 2
 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
 /lib/libc.so.1:0xc5364
 /lib/libc.so.1:0xb9e64
 /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
 /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess

Re: [OMPI devel] Trunk warnings

2014-12-12 Thread Edgar Gabriel


I'll take care of the one ompio warning.
Edgar

On 12/12/2014 12:01 PM, Nathan Hjelm wrote:


The osc warnings will go away after the btl modifications are applied. I
made signifigant changes to the component.

-Nathan

On Fri, Dec 12, 2014 at 09:49:47AM -0800, Ralph Castain wrote:

While building optimized on Linux:
bcol_ptpcoll_allreduce.c: In function
'bcol_ptpcoll_allreduce_narraying_init':
bcol_ptpcoll_allreduce.c:236: warning: unused variable 'dtype'
bcol_ptpcoll_allreduce.c:235: warning: unused variable `count'
io_ompio_file_set_view.c: In function
'mca_io_ompio_finalize_initial_grouping':
io_ompio_file_set_view.c:363: warning: 'sendreq' may be used uninitialized
in this function
osc_rdma_comm.c: In function 'ompi_osc_rdma_rget_accumulate_internal':
osc_rdma_comm.c:1034: warning: 'ptr' may be used uninitialized in this
function
osc_rdma_comm.c:1031: warning: 'frag' may be used uninitialized in this
function
osc_rdma_data_move.c: In function 'ompi_osc_rdma_callback':
osc_rdma_data_move.c:1647: warning: unused variable 'incoming_length'
osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send':
osc_rdma_data_move.c:225: warning: 'ptr' may be used uninitialized in this
function
osc_rdma_data_move.c:224: warning: 'frag' may be used uninitialized in
this function
osc_rdma_comm.c: In function 'ompi_osc_rdma_rget':
osc_rdma_comm.c:813: warning: 'ptr' may be used uninitialized in this
function
osc_rdma_comm.c:810: warning: 'frag' may be used uninitialized in this
function
osc_rdma_data_move.c: In function 'ompi_osc_gacc_long_start':
osc_rdma_data_move.c:973: warning: 'acc_data' may be used uninitialized in
this function
osc_rdma_comm.c: In function 'ompi_osc_rdma_put_w_req':
osc_rdma_comm.c:296: warning: 'ptr' may be used uninitialized in this
function
osc_rdma_comm.c:289: warning: 'frag' may be used uninitialized in this
function
osc_rdma_data_move.c: In function 'ompi_osc_rdma_gacc_start':
osc_rdma_data_move.c:924: warning: 'acc_data' may be used uninitialized in
this function
osc_rdma_data_move.c: In function 'ompi_osc_rdma_acc_long_start':
osc_rdma_data_move.c:839: warning: 'acc_data' may be used uninitialized in
this function
osc_rdma_comm.c: In function 'ompi_osc_rdma_accumulate_w_req':
osc_rdma_comm.c:479: warning: 'ptr' may be used uninitialized in this
function
osc_rdma_comm.c:476: warning: 'frag' may be used uninitialized in this
function
osc_rdma_comm.c: In function 'ompi_osc_rdma_get':
osc_rdma_comm.c:813: warning: 'ptr' may be used uninitialized in this
function
osc_rdma_comm.c:810: warning: 'frag' may be used uninitialized in this
function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
this function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
this function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
this function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
this function



___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16554.php




___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16555.php



--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335

Re: [OMPI devel] OpenIB has some borked code

2014-12-12 Thread Nathan Hjelm


As it is already given the commit is specified. Been thinking about
trying to bring it and a handful of other fixes to master before the
rest of the commits.

-Nathan

On Fri, Dec 12, 2014 at 11:08:46AM -0700, Howard Pritchard wrote:
>Nathan,
>Please make sure the fix for this problem is contained in its own commit.
>Howard
>2014-12-12 9:38 GMT-07:00 Nathan Hjelm :
> 
>  Yeah, that code is completely wrong. I have a fix in my btl
>  modifications branch.
> 
>  
> https://github.com/hjelmn/ompi/commit/38e961193074d382983d000e68adb721aaf3df7d
> 
>  -Nathan
> 
>  On Fri, Dec 12, 2014 at 08:26:34AM -0800, Ralph Castain wrote:
>  >Hey folks
>  >I've been looking into this warning:
>  >btl_openib_component.c: In function 'init_one_device':
>  >btl_openib_component.c:2019:54: warning: comparison between 'enum
>  >' and 'mca_base_var_source_t' [-Wenum-compare]
>  > else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
>  >  ^
>  >This warning is really valid - the equality can *never* be true.
>  >Essentially, someone defined two enum types, and is now trying to
>  check if
>  >one is equal to the other. This is the code block under concern:
>  >else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
>  >mca_btl_openib_component.receive_queues_source) {
>  >opal_show_help("help-mpi-btl-openib.txt",
>  >   "locally conflicting
>  receive_queues", true,
>  >   opal_install_dirs.opaldatadir,
>  >   opal_process_info.nodename,
>  >
>  > ibv_get_device_name(receive_queues_device->ib_dev),
>  >
>  > receive_queues_device->ib_dev_attr.vendor_id,
>  >
>  > receive_queues_device->ib_dev_attr.vendor_part_id,
>  > 
>   mca_btl_openib_component.receive_queues,
>  >   ibv_get_device_name(device->ib_dev),
>  >   device->ib_dev_attr.vendor_id,
>  >   device->ib_dev_attr.vendor_part_id,
>  > 
>   mca_btl_openib_component.default_recv_qps);
>  >ret = OPAL_ERR_RESOURCE_BUSY;
>  >goto error;
>  >}
>  >BTL_OPENIB_RQ_SOURCE_DEVICE_INI is defined as an enum in the openib
>  code.
>  >The receive_queues_source field is an MCA base enum that indicates
>  the
>  >source of the param. In this case, it is indicating that the source
>  was a
>  >file, but says nothing about which file.
>  >I don't want to step on toes to fix this, but the code clearly is
>  wrong.
>  >Can someone please fix it? It's in the master as well as in the 1.8
>  branch
>  >Thanks
>  >Ralph
> 
>  > ___
>  > devel mailing list
>  > de...@open-mpi.org
>  > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  > Link to this post:
>  http://www.open-mpi.org/community/lists/devel/2014/12/16546.php
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post:
>  http://www.open-mpi.org/community/lists/devel/2014/12/16550.php

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16556.php



pgpvsM40QurXj.pgp
Description: PGP signature

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Ralph Castain

I just checked it with —enable-memchecker —with-valgrind and found that many of 
these are legitimate leaks. We can take a look at them, though as I said, 
perhaps may wait for 1.8.5 as I wouldn’t hold up 1.8.4 for it.

> On Dec 12, 2014, at 9:26 AM, Eric Chamberland 
>  wrote:
> 
> On 12/12/2014 11:38 AM, Jeff Squyres (jsquyres) wrote:
>> Did you configure OMPI with --enable-memchecker?
> 
> No, only "--prefix="
> 
> Eric
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16553.php

Re: [OMPI devel] OpenIB has some borked code

2014-12-12 Thread Howard Pritchard

Nathan,

Please make sure the fix for this problem is contained in its own commit.

Howard


2014-12-12 9:38 GMT-07:00 Nathan Hjelm :
>
>
> Yeah, that code is completely wrong. I have a fix in my btl
> modifications branch.
>
>
> https://github.com/hjelmn/ompi/commit/38e961193074d382983d000e68adb721aaf3df7d
>
> -Nathan
>
> On Fri, Dec 12, 2014 at 08:26:34AM -0800, Ralph Castain wrote:
> >Hey folks
> >I've been looking into this warning:
> >btl_openib_component.c: In function 'init_one_device':
> >btl_openib_component.c:2019:54: warning: comparison between 'enum
> >' and 'mca_base_var_source_t' [-Wenum-compare]
> > else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
> >  ^
> >This warning is really valid - the equality can *never* be true.
> >Essentially, someone defined two enum types, and is now trying to
> check if
> >one is equal to the other. This is the code block under concern:
> >else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
> >mca_btl_openib_component.receive_queues_source) {
> >opal_show_help("help-mpi-btl-openib.txt",
> >   "locally conflicting receive_queues",
> true,
> >   opal_install_dirs.opaldatadir,
> >   opal_process_info.nodename,
> >
> > ibv_get_device_name(receive_queues_device->ib_dev),
> >
> > receive_queues_device->ib_dev_attr.vendor_id,
> >
> > receive_queues_device->ib_dev_attr.vendor_part_id,
> >
>  mca_btl_openib_component.receive_queues,
> >   ibv_get_device_name(device->ib_dev),
> >   device->ib_dev_attr.vendor_id,
> >   device->ib_dev_attr.vendor_part_id,
> >
>  mca_btl_openib_component.default_recv_qps);
> >ret = OPAL_ERR_RESOURCE_BUSY;
> >goto error;
> >}
> >BTL_OPENIB_RQ_SOURCE_DEVICE_INI is defined as an enum in the openib
> code.
> >The receive_queues_source field is an MCA base enum that indicates the
> >source of the param. In this case, it is indicating that the source
> was a
> >file, but says nothing about which file.
> >I don't want to step on toes to fix this, but the code clearly is
> wrong.
> >Can someone please fix it? It's in the master as well as in the 1.8
> branch
> >Thanks
> >Ralph
>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16546.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16550.php
>

Re: [OMPI devel] Trunk warnings

2014-12-12 Thread Nathan Hjelm


The osc warnings will go away after the btl modifications are applied. I
made signifigant changes to the component.

-Nathan

On Fri, Dec 12, 2014 at 09:49:47AM -0800, Ralph Castain wrote:
>While building optimized on Linux:
>bcol_ptpcoll_allreduce.c: In function
>'bcol_ptpcoll_allreduce_narraying_init':
>bcol_ptpcoll_allreduce.c:236: warning: unused variable 'dtype'
>bcol_ptpcoll_allreduce.c:235: warning: unused variable `count'
>io_ompio_file_set_view.c: In function
>'mca_io_ompio_finalize_initial_grouping':
>io_ompio_file_set_view.c:363: warning: 'sendreq' may be used uninitialized
>in this function
>osc_rdma_comm.c: In function 'ompi_osc_rdma_rget_accumulate_internal':
>osc_rdma_comm.c:1034: warning: 'ptr' may be used uninitialized in this
>function
>osc_rdma_comm.c:1031: warning: 'frag' may be used uninitialized in this
>function
>osc_rdma_data_move.c: In function 'ompi_osc_rdma_callback':
>osc_rdma_data_move.c:1647: warning: unused variable 'incoming_length'
>osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send':
>osc_rdma_data_move.c:225: warning: 'ptr' may be used uninitialized in this
>function
>osc_rdma_data_move.c:224: warning: 'frag' may be used uninitialized in
>this function
>osc_rdma_comm.c: In function 'ompi_osc_rdma_rget':
>osc_rdma_comm.c:813: warning: 'ptr' may be used uninitialized in this
>function
>osc_rdma_comm.c:810: warning: 'frag' may be used uninitialized in this
>function
>osc_rdma_data_move.c: In function 'ompi_osc_gacc_long_start':
>osc_rdma_data_move.c:973: warning: 'acc_data' may be used uninitialized in
>this function
>osc_rdma_comm.c: In function 'ompi_osc_rdma_put_w_req':
>osc_rdma_comm.c:296: warning: 'ptr' may be used uninitialized in this
>function
>osc_rdma_comm.c:289: warning: 'frag' may be used uninitialized in this
>function
>osc_rdma_data_move.c: In function 'ompi_osc_rdma_gacc_start':
>osc_rdma_data_move.c:924: warning: 'acc_data' may be used uninitialized in
>this function
>osc_rdma_data_move.c: In function 'ompi_osc_rdma_acc_long_start':
>osc_rdma_data_move.c:839: warning: 'acc_data' may be used uninitialized in
>this function
>osc_rdma_comm.c: In function 'ompi_osc_rdma_accumulate_w_req':
>osc_rdma_comm.c:479: warning: 'ptr' may be used uninitialized in this
>function
>osc_rdma_comm.c:476: warning: 'frag' may be used uninitialized in this
>function
>osc_rdma_comm.c: In function 'ompi_osc_rdma_get':
>osc_rdma_comm.c:813: warning: 'ptr' may be used uninitialized in this
>function
>osc_rdma_comm.c:810: warning: 'frag' may be used uninitialized in this
>function
>vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
>vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
>this function
>vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
>vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
>this function
>vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
>vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
>this function
>vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
>vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
>this function

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16554.php



pgp5M2E8mThOn.pgp
Description: PGP signature

[OMPI devel] Trunk warnings

2014-12-12 Thread Ralph Castain

While building optimized on Linux:

bcol_ptpcoll_allreduce.c: In function 'bcol_ptpcoll_allreduce_narraying_init':
bcol_ptpcoll_allreduce.c:236: warning: unused variable 'dtype'
bcol_ptpcoll_allreduce.c:235: warning: unused variable ‘count'

io_ompio_file_set_view.c: In function 'mca_io_ompio_finalize_initial_grouping':
io_ompio_file_set_view.c:363: warning: 'sendreq' may be used uninitialized in 
this function

osc_rdma_comm.c: In function 'ompi_osc_rdma_rget_accumulate_internal':
osc_rdma_comm.c:1034: warning: 'ptr' may be used uninitialized in this function
osc_rdma_comm.c:1031: warning: 'frag' may be used uninitialized in this function
osc_rdma_data_move.c: In function 'ompi_osc_rdma_callback':
osc_rdma_data_move.c:1647: warning: unused variable 'incoming_length'
osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send':
osc_rdma_data_move.c:225: warning: 'ptr' may be used uninitialized in this 
function
osc_rdma_data_move.c:224: warning: 'frag' may be used uninitialized in this 
function
osc_rdma_comm.c: In function 'ompi_osc_rdma_rget':
osc_rdma_comm.c:813: warning: 'ptr' may be used uninitialized in this function
osc_rdma_comm.c:810: warning: 'frag' may be used uninitialized in this function
osc_rdma_data_move.c: In function 'ompi_osc_gacc_long_start':
osc_rdma_data_move.c:973: warning: 'acc_data' may be used uninitialized in this 
function
osc_rdma_comm.c: In function 'ompi_osc_rdma_put_w_req':
osc_rdma_comm.c:296: warning: 'ptr' may be used uninitialized in this function
osc_rdma_comm.c:289: warning: 'frag' may be used uninitialized in this function
osc_rdma_data_move.c: In function 'ompi_osc_rdma_gacc_start':
osc_rdma_data_move.c:924: warning: 'acc_data' may be used uninitialized in this 
function
osc_rdma_data_move.c: In function 'ompi_osc_rdma_acc_long_start':
osc_rdma_data_move.c:839: warning: 'acc_data' may be used uninitialized in this 
function
osc_rdma_comm.c: In function 'ompi_osc_rdma_accumulate_w_req':
osc_rdma_comm.c:479: warning: 'ptr' may be used uninitialized in this function
osc_rdma_comm.c:476: warning: 'frag' may be used uninitialized in this function
osc_rdma_comm.c: In function 'ompi_osc_rdma_get':
osc_rdma_comm.c:813: warning: 'ptr' may be used uninitialized in this function
osc_rdma_comm.c:810: warning: 'frag' may be used uninitialized in this function


vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in this 
function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in this 
function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in this 
function
vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in this 
function

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Eric Chamberland


On 12/12/2014 11:38 AM, Jeff Squyres (jsquyres) wrote:

Did you configure OMPI with --enable-memchecker?


No, only "--prefix="

Eric

Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove

Ralph,

The "arch" variable looks fine:
Current function is opal_hwloc_base_get_topo_signature
 2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
(dbx) print arch
arch = 0x1001700a0 "sun4v"

And so is "fmt":

Current function is opal_asprintf
  194   length = opal_vasprintf(ptr, fmt, ap);
(dbx) print fmt
fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"

However, things have gone bad in guess_strlen():

Current function is guess_strlen
   71   len += (int)strlen(sarg);
(dbx) print sarg
sarg = 0x2 ""

-Paul

On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:

> Hmmmthis is really odd. I actually do have a protection for that arch
> value being NULL, and you are in the code section when it isn't.
>
> Do you still have the core file around? If so, can you print out the value
> of the "arch" variable? It would be in the
> opal_hwloc_base_get_topo_signature level.
>
> I'm wondering if that value has been hosed, and the problem is memory
> corruption somewhere.
>
>
> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>
> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn't
> returning an architecture type for some reason, and I didn't protect
> against it.
>
>
> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>
> Backtrace for the Solaris-10/SPARC SEGV appears below.
> I've changed the subject line to distinguish this from the earlier report.
>
> -Paul
>
> program terminated by signal SEGV (no mapping at the fault address)
> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
> Current function is guess_strlen
>71   len += (int)strlen(sarg);
> (dbx) where
>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
> 0x7d93b634
> =>[2] guess_strlen(fmt = 0x7eeada98
> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
> "printf.c"
>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
> "printf.c"
>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
> "printf.c"
>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in
> "hwloc_base_util.c"
>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>   [7] orte_init(pargc = 0x761c, pargv = 0x7610,
> flags = 4U), line 148 in "orte_init.c"
>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>
> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>
>> No, that looks different - it's failing in mpirun itself. Can you get a
>> line number on it?
>>
>> Sorry for delay - I'm generating rc3 now
>>
>>
>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
>>
>> Don't see an rc3 yet.
>>
>> My Solaris-10/SPARC runs fail slightly differently (see below).
>> It looks sufficiently similar that it MIGHT be the same root cause.
>> However, lacking an rc3 to test I figured it would be better to report
>> this than to ignore it.
>>
>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and
>> Sun compilers.
>>
>> -Paul
>>
>> [niagara1:29881] *** Process received signal ***
>> [niagara1:29881] Signal: Segmentation Fault (11)
>> [niagara1:29881] Signal code: Address not mapped (1)
>> [niagara1:29881] Failing at address: 2
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
>> ktrace_print+0x24
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>> /lib/libc.so.1:0xc5364
>> /lib/libc.so.1:0xb9e64
>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vas
>> printf+0x20
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asp
>> rintf+0x30
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwl
>> oc_base_get_topo_signature+0x24c
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_ini
>> t+0x2f8
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
>>
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
>> [niagara1:29881] *** End of error message ***
>> Segmentation Fault - core dumped
>>
>> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain  wrote:

[OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove

First, I want to ask what became of the issue discussed in this thread?
   http://www.open-mpi.org/community/lists/devel/2014/11/16160.php
I though we had concluded that one just needed -D_REENTRANT.
I mention that only for completeness, because I think my current problem is
different.

The following works fine with 1.8.3, making the current behavior a
regression.

I am still on the same system as that previous report, and still/again see
a message like the following:


A process or daemon was unable to complete a TCP connection
to another process:
  Local host:pcp-j-19
  Remote host:   172.18.0.120
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
[...etc...]

It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the
address 172.18.0.120 are on different subnets.

I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at
configure time (I didn't bother to check if it there by default now or not).

NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only
the 172.16.0.120 subnet.
IN FACT, the message is the same with that option, other than "172.18"
changing to "172.16".

I've attached the output generated by "-mca oob_base_verbose 20" both with
and without the oob_tcp_if_include.

I should also note that that the following is my full mpirun command, which
excludes the tcp BTL.
pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca
btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[pcp-j-20:29156] mca: base: components_register: registering oob components
[pcp-j-20:29156] mca: base: components_register: found loaded component tcp
[pcp-j-20:29156] mca: base: components_register: component tcp register 
function successful
[pcp-j-20:29156] mca: base: components_open: opening oob components
[pcp-j-20:29156] mca: base: components_open: found loaded component tcp
[pcp-j-20:29156] mca: base: components_open: component tcp open function 
successful
[pcp-j-20:29156] mca:oob:select: checking available component tcp
[pcp-j-20:29156] mca:oob:select: Querying component [tcp]
[pcp-j-20:29156] oob:tcp: component_available called
[pcp-j-20:29156] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-20:29156] [[4268,0],0] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-20:29156] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-20:29156] [[4268,0],0] oob:tcp:init adding 172.16.0.120 to our list of 
V4 connections
[pcp-j-20:29156] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-20:29156] [[4268,0],0] oob:tcp:init rejecting interface p.ibp0 (not 
in include list)
[pcp-j-20:29156] [[4268,0],0] TCP STARTUP
[pcp-j-20:29156] [[4268,0],0] attempting to bind to IPv4 port 0
[pcp-j-20:29156] [[4268,0],0] assigned IPv4 port 33536
[pcp-j-20:29156] mca:oob:select: Adding component to end
[pcp-j-20:29156] mca:oob:select: Found 1 active transports
[pcp-j-19:26282] mca: base: components_register: registering oob components
[pcp-j-19:26282] mca: base: components_register: found loaded component tcp
[pcp-j-19:26282] mca: base: components_register: component tcp register 
function successful
[pcp-j-19:26282] mca: base: components_open: opening oob components
[pcp-j-19:26282] mca: base: components_open: found loaded component tcp
[pcp-j-19:26282] mca: base: components_open: component tcp open function 
successful
[pcp-j-19:26282] mca:oob:select: checking available component tcp
[pcp-j-19:26282] mca:oob:select: Querying component [tcp]
[pcp-j-19:26282] oob:tcp: component_available called
[pcp-j-19:26282] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-19:26282] [[4268,0],1] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-19:26282] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-19:26282] [[4268,0],1] oob:tcp:init adding 172.16.0.119 to our list of 
V4 connections
[pcp-j-19:26282] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-19:26282] [[4268,0],1] oob:tcp:init rejecting interface p.ibp0 (not 
in include list)
[pcp-j-19:26282] [[4268,0],1] TCP STARTUP
[pcp-j-19:26282] [[4268,0],1] attempting to bind to IPv4 port 0
[pcp-j-19:26282] [[4268,0],1] assigned IPv4 port 33429
[pcp-j-19:26282] mca:oob:select: Adding component to end
[pcp-j-19:26282] mca:oob:select: Found 1 active transports
[pcp-j-19:26282] [[4268,0],1]: set_addr to uri 
279707648.0;tcp://172.16.0.120:33536
[pcp-j-19:26282]

Re: [OMPI devel] OpenIB has some borked code

2014-12-12 Thread Nathan Hjelm


Yeah, that code is completely wrong. I have a fix in my btl
modifications branch.

https://github.com/hjelmn/ompi/commit/38e961193074d382983d000e68adb721aaf3df7d

-Nathan

On Fri, Dec 12, 2014 at 08:26:34AM -0800, Ralph Castain wrote:
>Hey folks
>I've been looking into this warning:
>btl_openib_component.c: In function 'init_one_device':
>btl_openib_component.c:2019:54: warning: comparison between 'enum
>' and 'mca_base_var_source_t' [-Wenum-compare]
> else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
>  ^
>This warning is really valid - the equality can *never* be true.
>Essentially, someone defined two enum types, and is now trying to check if
>one is equal to the other. This is the code block under concern:
>else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
>mca_btl_openib_component.receive_queues_source) {
>opal_show_help("help-mpi-btl-openib.txt",
>   "locally conflicting receive_queues", true,
>   opal_install_dirs.opaldatadir,
>   opal_process_info.nodename,
> 
> ibv_get_device_name(receive_queues_device->ib_dev),
> 
> receive_queues_device->ib_dev_attr.vendor_id,
> 
> receive_queues_device->ib_dev_attr.vendor_part_id,
>   mca_btl_openib_component.receive_queues,
>   ibv_get_device_name(device->ib_dev),
>   device->ib_dev_attr.vendor_id,
>   device->ib_dev_attr.vendor_part_id,
>   mca_btl_openib_component.default_recv_qps);
>ret = OPAL_ERR_RESOURCE_BUSY;
>goto error;
>}
>BTL_OPENIB_RQ_SOURCE_DEVICE_INI is defined as an enum in the openib code.
>The receive_queues_source field is an MCA base enum that indicates the
>source of the param. In this case, it is indicating that the source was a
>file, but says nothing about which file.
>I don't want to step on toes to fix this, but the code clearly is wrong.
>Can someone please fix it? It's in the master as well as in the 1.8 branch
>Thanks
>Ralph

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16546.php



pgpR5ISHjwGgq.pgp
Description: PGP signature

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Jeff Squyres (jsquyres)

Did you configure OMPI with --enable-memchecker?


On Dec 12, 2014, at 8:35 AM, Ralph Castain  wrote:

> We have made more of an effort to get valgrind clean on the master - haven’t 
> brought all of it across due to the desire to minimize change in 1.8
> 
> I’ll see what can be done, probably more for 1.8.5 at this point. Most of 
> these look like legitimate leaks that should be addressed as opposed to 
> suppressed.
> 
> 
>> On Dec 12, 2014, at 8:00 AM, Eric Chamberland 
>>  wrote:
>> 
>> On 12/11/2014 05:45 AM, Ralph Castain wrote:
>> ...
>> 
>>> by the reporters. Still, I would appreciate a fairly thorough testing as
>>> this is expected to be the last 1.8 series release for some time.
>> 
>> Is is relevant to report valgrind leaks?  Maybe they are "normal" or not, I 
>> don't know.  If they are normal, maybe suppressions should be added to 
>> .../share/openmpi/openmpi-valgrind.supp before the release?
>> 
>> Here is a simple test case ;-) :
>> 
>> cat mpi_init_finalize.c
>> 
>> #include "mpi.h"
>> 
>> int main(int argc, char *argv[])
>> {
>>   MPI_Init(&argc, &argv);
>>   MPI_Finalize();
>>   return 0;
>> }
>> 
>> 
>> mpicc -o mpi_init_finalize mpi_init_finalize.c
>> 
>> mpiexec -np 1 valgrind -v 
>> --suppressions=/opt/openmpi-1.8.4rc2/share/openmpi/openmpi-valgrind.supp 
>> --gen-suppressions=all --leak-check=full --leak-resolution=high 
>> --show-reachable=yes --error-limit=no --num-callers=24 --track-fds=yes 
>> --log-file=valgrind_out.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize
>> 
>> running with 2 processes generates some more:
>> 
>> mpiexec -np 2  --log-file=valgrind_out_2proc.n%q{OMPI_COMM_WORLD_RANK} 
>> ./mpi_init_finalize
>> 
>> which results in the files attached...
>> 
>> Thanks,
>> 
>> Eric
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16545.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16548.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Ralph Castain

We have made more of an effort to get valgrind clean on the master - haven’t 
brought all of it across due to the desire to minimize change in 1.8

I’ll see what can be done, probably more for 1.8.5 at this point. Most of these 
look like legitimate leaks that should be addressed as opposed to suppressed.


> On Dec 12, 2014, at 8:00 AM, Eric Chamberland 
>  wrote:
> 
> On 12/11/2014 05:45 AM, Ralph Castain wrote:
> ...
> 
>> by the reporters. Still, I would appreciate a fairly thorough testing as
>> this is expected to be the last 1.8 series release for some time.
> 
> Is is relevant to report valgrind leaks?  Maybe they are "normal" or not, I 
> don't know.  If they are normal, maybe suppressions should be added to 
> .../share/openmpi/openmpi-valgrind.supp before the release?
> 
> Here is a simple test case ;-) :
> 
> cat mpi_init_finalize.c
> 
> #include "mpi.h"
> 
> int main(int argc, char *argv[])
> {
>MPI_Init(&argc, &argv);
>MPI_Finalize();
>return 0;
> }
> 
> 
> mpicc -o mpi_init_finalize mpi_init_finalize.c
> 
> mpiexec -np 1 valgrind -v 
> --suppressions=/opt/openmpi-1.8.4rc2/share/openmpi/openmpi-valgrind.supp 
> --gen-suppressions=all --leak-check=full --leak-resolution=high 
> --show-reachable=yes --error-limit=no --num-callers=24 --track-fds=yes 
> --log-file=valgrind_out.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize
> 
> running with 2 processes generates some more:
> 
> mpiexec -np 2  --log-file=valgrind_out_2proc.n%q{OMPI_COMM_WORLD_RANK} 
> ./mpi_init_finalize
> 
> which results in the files attached...
> 
> Thanks,
> 
> Eric
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16545.php

Re: [OMPI devel] [1.8.4rc3] dangling symlinks

2014-12-12 Thread Ralph Castain

Fixed in master, setup for 1.8.4 - thanks Paul!

> On Dec 11, 2014, at 11:47 PM, Paul Hargrove  wrote:
> 
> On a Linux system configured without java support I see the following two 
> dangling symlinks installed in ${prefix}/bin:
> 
> lrwxrwxrwx  1 phhargrove phhargrove 8 Dec 11 23:52 oshjavac -> mpijavac
> lrwxrwxrwx  1 phhargrove phhargrove 8 Dec 11 23:52 shmemjavac -> mpijavac 
> 
> It seems there is some logic missing to make installation of those links 
> conditional on Java support.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16536.php

[OMPI devel] OpenIB has some borked code

2014-12-12 Thread Ralph Castain

Hey folks

I’ve been looking into this warning:

btl_openib_component.c: In function 'init_one_device':
btl_openib_component.c:2019:54: warning: comparison between 'enum ' 
and 'mca_base_var_source_t' [-Wenum-compare]
 else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
  ^


This warning is really valid - the equality can *never* be true. Essentially, 
someone defined two enum types, and is now trying to check if one is equal to 
the other. This is the code block under concern:

else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==
mca_btl_openib_component.receive_queues_source) {
opal_show_help("help-mpi-btl-openib.txt",
   "locally conflicting receive_queues", true,
   opal_install_dirs.opaldatadir,
   opal_process_info.nodename,
   
ibv_get_device_name(receive_queues_device->ib_dev),
   receive_queues_device->ib_dev_attr.vendor_id,
   
receive_queues_device->ib_dev_attr.vendor_part_id,
   mca_btl_openib_component.receive_queues,
   ibv_get_device_name(device->ib_dev),
   device->ib_dev_attr.vendor_id,
   device->ib_dev_attr.vendor_part_id,
   mca_btl_openib_component.default_recv_qps);
ret = OPAL_ERR_RESOURCE_BUSY;
goto error;
}

BTL_OPENIB_RQ_SOURCE_DEVICE_INI is defined as an enum in the openib code. The 
receive_queues_source field is an MCA base enum that indicates the source of 
the param. In this case, it is indicating that the source was a file, but says 
nothing about which file.

I don’t want to step on toes to fix this, but the code clearly is wrong. Can 
someone please fix it? It’s in the master as well as in the 1.8 branch

Thanks
Ralph

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Eric Chamberland


On 12/11/2014 05:45 AM, Ralph Castain wrote:
...


by the reporters. Still, I would appreciate a fairly thorough testing as
this is expected to be the last 1.8 series release for some time.


Is is relevant to report valgrind leaks?  Maybe they are "normal" or 
not, I don't know.  If they are normal, maybe suppressions should be 
added to .../share/openmpi/openmpi-valgrind.supp before the release?


Here is a simple test case ;-) :

cat mpi_init_finalize.c

#include "mpi.h"

int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Finalize();
return 0;
}


mpicc -o mpi_init_finalize mpi_init_finalize.c

mpiexec -np 1 valgrind -v 
--suppressions=/opt/openmpi-1.8.4rc2/share/openmpi/openmpi-valgrind.supp 
--gen-suppressions=all --leak-check=full --leak-resolution=high 
--show-reachable=yes --error-limit=no --num-callers=24 --track-fds=yes 
--log-file=valgrind_out.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize


running with 2 processes generates some more:

mpiexec -np 2  
--log-file=valgrind_out_2proc.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize


which results in the files attached...

Thanks,

Eric



valgrind_out.tgz
Description: application/compressed-tar

Re: [OMPI devel] [1.8.4rc2] build broken by default on SGI UV

2014-12-12 Thread Nathan Hjelm


Hmm, I thought we already cleaned that up in 1.8. I will take a look
today.

BTW, can you send me the sn/xpmem.h file from your machine. I might have
an idea what is going wrong. Can't seen to find the link the SGI's
tarball on their oss site.

-Nathan

On Thu, Dec 11, 2014 at 06:53:00PM -0800, Paul Hargrove wrote:
>I think I've reported this earlier in the 1.8 series.
>If I compile on an SGI UV (e.g. blacklight at PSC) configure picks up the
>presence of xpmem headers and enables the vader BTL.
>However, the port of vader to SGI's "flavor" of xpmem is incomplete and
>the following build failure results:
>make[2]: Entering directory
>
> `/brashear/hargrove/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/BLD/ompi/mca/btl/vader'
>  CC   btl_vader_module.lo
>In file included from
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader.h:60,
> from
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:29:
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_endpoint.h:76:
>error: expected specifier-qualifier-list before 'xpmem_apid_t'
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:
>In function 'init_vader_endpoint':
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:197:
>error: 'struct ' has no member named 'apid'
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:
>In function 'mca_btl_vader_endpoint_destructor':
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:682:
>error: 'struct ' has no member named 'apid'
>
> /usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:683:
>error: 'struct ' has no member named 'apid'
>make[2]: *** [btl_vader_module.lo] Error 1
>make[2]: Leaving directory
>
> `/brashear/hargrove/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/BLD/ompi/mca/btl/vader'
>make[1]: *** [all-recursive] Error 1
>make[1]: Leaving directory
>`/brashear/hargrove/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/BLD/ompi'
>make: *** [all-recursive] Error 1
>This can trivially be fixed in configure if one doesn't recognize the SGI
>variant of xpmem.
>I think (untested) that the following is sufficient:
>--- ./ompi/mca/btl/vader/configure.m4~  2014-12-11 18:51:11.499654000
>-0800
>+++ ./ompi/mca/btl/vader/configure.m4   2014-12-11 18:51:52.289654000
>-0800
>@@ -23,7 +23,7 @@
> AC_ARG_WITH([xpmem],
> [AC_HELP_STRING([--with-xpmem(=DIR)],
> [Build with XPMEM kernel module support, searching for
>headers in DIR])])
>-OMPI_CHECK_WITHDIR([xpmem], [$with_xpmem], [include/xpmem.h
>include/sn/xpmem.h])
>+OMPI_CHECK_WITHDIR([xpmem], [$with_xpmem], [include/xpmem.h])
> 
> AC_ARG_WITH([xpmem-libdir],
> [AC_HELP_STRING([--with-xpmem-libdir=DIR],
>-Paul
>--
>Paul H. Hargrove  phhargr...@lbl.gov
>Computer Languages & Systems Software (CLaSS) Group
>Computer Science Department   Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16520.php



pgpIkiK0phckc.pgp
Description: PGP signature

Re: [OMPI devel] OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Gilles Gouaillardet

Ralph,

I can do that starting from monday

Cheers,

Gilles

Ralph Castain  wrote:
>Thanks Brice!
>
>Our 1.8 branch probably has another 2 or so years in it, but I think we can 
>lock it down fairly soon. Since we’ve shaken a lot of the bugs out of 1.8, we 
>are now seeing the “adoption wave” that is causing bug reports. Once we get 
>thru this, I expect things will settle down again.
>
>I know Jeff is hosed, and I’m likewise next week. Can someone create a PR to 
>update 1.8 with these patches?
>
>
>> On Dec 12, 2014, at 12:32 AM, Brice Goglin  wrote:
>> 
>> Le 12/12/2014 07:36, Gilles Gouaillardet a écrit :
>>> Brice,
>>> 
>>> ompi master is based on hwloc 1.9.1, isn't it ?
>> 
>> Yes sorry, I am often confused by all these OMPI vs hwloc branch numbers.
>> 
>>> 
>>> if some backport is required for hwloc 1.7.2 (used by ompi v1.8), then
>>> could you please update the hwloc v1.7 branch ?
>> 
>> Done. I pushed 14 commits there. This branch lags significantly behind 
>> master and v1.10 so I don't think I'll be able to maintain it much longer.
>> 
>> Brice
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16538.php
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16540.php

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-12 Thread Pim Schellart

Dear All,

we have now recompiled both openmpi (1.8.3) and SLURM against an externally 
compiled and installed hwloc (1.10.0). With these changes the out-of-order 
topology discovery warning disappears. By now we also believe the problem was 
probably somewhere in SLURM rather than in openmpi but where exactly we don’t 
know. Thank you for your help in solving this!

Kind regards,

Pim Schellart

> On 11 Dec 2014, at 04:19, Gilles Gouaillardet  
> wrote:
> 
> Ralph,
> 
> You are right,
> please disregard my previous post, it was irrelevant.
> 
> i just noticed that unlike ompi v1.8 (hwloc 1.7.2 based => no warning), 
> master has this warning (hwloc 1.9.1)
> 
> i will build slurm vs a recent hwloc and see what happens
> (FWIW RHEL6 comes with hwloc 1.5, RHEL7 comes with hwloc 1.7 and both do 
> *not* have this warning)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/11 12:02, Ralph Castain wrote:
>> Per his prior notes, he is using mpirun to launch his jobs. Brice has 
>> confirmed that OMPI doesn’t have that hwloc warning in it. So either he has 
>> inadvertently linked against the Ubuntu system version of hwloc, or the 
>> message must be coming from Slurm.
>> 
>> 
>> 
>>> On Dec 10, 2014, at 6:14 PM, Gilles Gouaillardet 
>>> 
>>>  wrote:
>>> 
>>> Pim,
>>> 
>>> at this stage, all i can do is acknowledge your slurm is configured to use 
>>> cgroups.
>>> 
>>> and based on your previous comment (e.g. problem only occurs with several 
>>> jobs on the same node)
>>> that *could* be a bug in OpenMPI (or hwloc).
>>> 
>>> by the way, how do you start your mpi application ?
>>> - do you use mpirun ?
>>> - do you use srun --resv-ports ?
>>> 
>>> i'll try to reproduce this in my test environment.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/11 2:45, Pim Schellart wrote:
>>> 
 Dear Gilles et al.,
 
 we tested with openmpi compiled from source (version 1.8.3) both with:
 
 ./configure --prefix=/usr/local/openmpi --disable-silent-rules 
 --with-libltdl=external --with-devel-headers --with-slurm 
 --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi
 
 and
 
 ./configure --prefix=/usr/local/openmpi --with-hwloc=/usr 
 --disable-silent-rules --with-libltdl=external --with-devel-headers 
 --with-slurm --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi
 
 (e.g. with embedded and external hwloc) and the issue remains the same. 
 Meanwhile we have found another interesting detail. A job is started 
 consisting of four tasks split over two nodes. If this is the only job 
 running on those nodes the out-of-order warnings do not appear. However, 
 if multiple jobs are running the warnings do appear but only for the jobs 
 that are started later. We suspect that this is because for the first 
 started job the CPU cores assigned are 0 and 1 whereas they are different 
 for the later started jobs. I attached the output (including lstopo —of 
 xml output (called for each task)) for both the working and broken case 
 again.
 
 Kind regards,
 
 Pim Schellart
 
 
 
 
 
> On 09 Dec 2014, at 09:38, Gilles Gouaillardet 
>  
>  wrote:
> 
> Pim,
> 
> if you configure OpenMPI with --with-hwloc=external (or something like
> --with-hwloc=/usr) it is very likely
> OpenMPI will use the same hwloc library (e.g. the "system" library) that
> is used by SLURM
> 
> /* i do not know how Ubuntu packages OpenMPI ... */
> 
> 
> The default (e.g. no --with-hwloc parameter in the configure command
> line) is to use the hwloc library that is embedded within OpenMPI
> 
> Gilles
> 
> On 2014/12/09 17:34, Pim Schellart wrote:
> 
>> Ah, ok so that was where the confusion came from, I did see hwloc in the 
>> SLURM sources but couldn’t immediately figure out where exactly it was 
>> used. We will try compiling openmpi with the embedded hwloc. Any 
>> particular flags I should set?
>> 
>> 
>>> On 09 Dec 2014, at 09:30, Ralph Castain  
>>> 
>>>  wrote:
>>> 
>>> There is no linkage between slurm and ompi when it comes to hwloc. If 
>>> you directly launch your app using srun, then slurm will use its 
>>> version of hwloc to do the binding. If you use mpirun to launch the 
>>> app, then we’ll use our internal version to do it.
>>> 
>>> The two are completely isolated from each other.
>>> 
>>> 
>>> 
 On Dec 9, 2014, at 12:25 AM, Pim Schellart  
 
  wrote:
 
 The version that “lstopo --version” reports is the same (1.8) on all 
 nodes, but we may indeed be hitting the second issue. We can try to 
 compile a new version of openmpi, but how do we ensure that the 
 external programs (e.g. SLURM) are using t

Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain

Hmmm….this is really odd. I actually do have a protection for that arch value 
being NULL, and you are in the code section when it isn’t.

Do you still have the core file around? If so, can you print out the value of 
the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature 
level.

I’m wondering if that value has been hosed, and the problem is memory 
corruption somewhere.


> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
> 
> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
> returning an architecture type for some reason, and I didn’t protect against 
> it.
> 
> 
>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove > > wrote:
>> 
>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>> I've changed the subject line to distinguish this from the earlier report.
>> 
>> -Paul
>> 
>> program terminated by signal SEGV (no mapping at the fault address)
>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) where
>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
>> 0x7d93b634 
>> =>[2] guess_strlen(fmt = 0x7eeada98 
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
>> "printf.c"
>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
>> "printf.c"
>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
>> "printf.c"
>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
>> "hwloc_base_util.c"
>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610, 
>> flags = 4U), line 148 in "orte_init.c"
>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>> 
>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain > > wrote:
>> No, that looks different - it’s failing in mpirun itself. Can you get a line 
>> number on it?
>> 
>> Sorry for delay - I’m generating rc3 now
>> 
>> 
>>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove >> > wrote:
>>> 
>>> Don't see an rc3 yet.
>>> 
>>> My Solaris-10/SPARC runs fail slightly differently (see below).
>>> It looks sufficiently similar that it MIGHT be the same root cause.
>>> However, lacking an rc3 to test I figured it would be better to report this 
>>> than to ignore it.
>>> 
>>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun 
>>> compilers.
>>> 
>>> -Paul
>>> 
>>> [niagara1:29881] *** Process received signal ***
>>> [niagara1:29881] Signal: Segmentation Fault (11)
>>> [niagara1:29881] Signal code: Address not mapped (1)
>>> [niagara1:29881] Failing at address: 2
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>>> /lib/libc.so.1:0xc5364
>>> /lib/libc.so.1:0xb9e64
>>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
>>> [niagara1:29881] *** End of error message ***
>>> Segmentation Fault - core dumped
>>> 
>>> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain >> > wrote:
>>> Ah crud - incomplete commit means we didn’t send the topo string. Will roll 
>>> rc3 in a few minutes.
>>> 
>>> Thanks, Paul
>>> Ralph
>>> 
 On Dec 11, 2014, at 3:08 PM, Paul Hargrove >>> > wrote:
 
 Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting 
 the following crash for both "-m32" and "-m64" builds:
 
 $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 
 examples/ring_c'

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Ralph Castain

Thanks Brice!

Our 1.8 branch probably has another 2 or so years in it, but I think we can 
lock it down fairly soon. Since we’ve shaken a lot of the bugs out of 1.8, we 
are now seeing the “adoption wave” that is causing bug reports. Once we get 
thru this, I expect things will settle down again.

I know Jeff is hosed, and I’m likewise next week. Can someone create a PR to 
update 1.8 with these patches?

> On Dec 12, 2014, at 12:32 AM, Brice Goglin  wrote:
> 
> Le 12/12/2014 07:36, Gilles Gouaillardet a écrit :
>> Brice,
>> 
>> ompi master is based on hwloc 1.9.1, isn't it ?
> 
> Yes sorry, I am often confused by all these OMPI vs hwloc branch numbers.
> 
>> 
>> if some backport is required for hwloc 1.7.2 (used by ompi v1.8), then
>> could you please update the hwloc v1.7 branch ?
> 
> Done. I pushed 14 commits there. This branch lags significantly behind master 
> and v1.10 so I don't think I'll be able to maintain it much longer.
> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16538.php

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Ralph Castain

Kewl - thanks to both of you for the explanation. I’ll make the adjustment.

> On Dec 11, 2014, at 9:10 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> The "understanding" Gilles just expresses matches my own.
> 
> The issue that the OP observed on an ARM/Linux system (and I was able to 
> reproduce on Linux w/ any arch) is that when the LO interface is missing 
> Linux is unable to pass loopback messages sent on ANY interface.  The oob_tcp 
> code was trying to connect to a 172.18.0.x address when I reproduced it.
> 
> In summary:
> 
> For LINUX the lack of a loopback interface (selected or not) prevents local 
> connection.
> For NON-LINUX the lack of a loopback interface MAKES NO DIFFERENCE.
> 
> So, I think Gilles's version is correct, but that making the logic (at least 
> the reporting) conditional on Linux might be an improvement.
> 
> Since this is a warning, it might be better to remove from 1.8 until we have 
> more certainty about where/when it matters.  I don't think users will 
> appreciate a "cry wolf" release.
> 
> -Paul
> 
> On Thu, Dec 11, 2014 at 9:01 PM, Gilles Gouaillardet 
> mailto:gilles.gouaillar...@iferc.org>> wrote:
> Ralph,
> 
> here is my understanding of what happens on Linux :
> 
> lo: 127.0.0.1/8 
> eth0: 192.168.122.101/24 
> 
> mpirun --mca orte_oob_tcp_if_include eth0 ...
> 
> so the mpi task tries to contact orted/mpirun on 192.168.0.1/24 
> 
> 
> that works just fine if the loopback interface is active,
> and that hangs if there is no loopback interface.
> 
> 
> imho that is a linux oddity, and OMPI has nothing to do with it
> 
> Cheers,
> 
> Gilles
> 
> [root@slurm1 ~]# ping -c 3 192.168.122.101
> PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
> 64 bytes from 192.168.122.101 : icmp_seq=1 ttl=64 
> time=0.013 ms
> 64 bytes from 192.168.122.101 : icmp_seq=2 ttl=64 
> time=0.009 ms
> 64 bytes from 192.168.122.101 : icmp_seq=3 ttl=64 
> time=0.011 ms
> 
> --- 192.168.122.101 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
> rtt min/avg/max/mdev = 0.009/0.011/0.013/0.001 ms
> 
> 
> 
> [root@slurm1 ~]# ifdown lo
> [root@slurm1 ~]# ping -c 3 192.168.122.101
> PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
> 
> --- 192.168.122.101 ping statistics ---
> 3 packets transmitted, 0 received, 100% packet loss, time 11999ms
> 
> 
> 
> On 2014/12/12 13:54, Ralph Castain wrote:
>> I honestly think it has to be a selected interface, Gilles, else we will 
>> fail to connect.
>> 
>>> On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> Paul,
>>> 
>>> about the five warnings :
>>> can you confirm you are running mpirun *not* on n15 nor n16 ?
>>> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 
>>> orted + 2 mpi tasks
>>> 
>>> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your 
>>> openmpi-mca-params.conf ?
>>> 
>>> here is attached a patch to fix this issue.
>>> what we really want is test there is a loopback interface, period.
>>> the current code (my bad for not having reviewed in a timely manner) seems 
>>> to check
>>> there is a *selected* loopback interface.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/12 13:15, Paul Hargrove wrote:
 Ralph,
 
 Sorry to be the bearer of more bad news.
 The "good" news is I've seen the new warning regarding the lack of a
 loopback interface.
 The BAD news is that it is occurring on a Linux cluster that I'ver verified
 DOES have 'lo' configured on the front-end and compute nodes (UP and
 RUNNING according to ifconfig).
 
 Though run with "-np 2" the warning appears FIVE times.
 ADDITIONALLY, there is a SEGV at exit!
 
 Unfortunately, despite configuring with --enable-debug, I didn't get line
 numbers from the core (and there was no backtrace printed).
 
 All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
 
 Let me know what tracing flags to apply to gather the info needed to debug
 this.
 
 -Paul
 
 
 $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
 --
 WARNING: No loopback interface was found. This can cause problems
 when we spawn processes as they are likely to be unable to connect
 back to their host daemon. Sadly, it may take awhile for the connect
 attempt to fail, so you may experience a significant hang time.
 
 You may wish to ctrl-c out of your job and activate loopback support
 on at least one interface before trying again.
 
 --
 [... above message FOUR more times ...]
 Process 1 exiting
 Pr

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Brice Goglin

Le 12/12/2014 07:36, Gilles Gouaillardet a écrit :
> Brice,
>
> ompi master is based on hwloc 1.9.1, isn't it ?

Yes sorry, I am often confused by all these OMPI vs hwloc branch numbers.

>
> if some backport is required for hwloc 1.7.2 (used by ompi v1.8), then
> could you please update the hwloc v1.7 branch ?

Done. I pushed 14 commits there. This branch lags significantly behind
master and v1.10 so I don't think I'll be able to maintain it much longer.

Brice

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-12 Thread Pascal Deveze

George,

My initial problem is that when MPI is compiled with 
“--enable-mpi-thread-multiple”, the variable enable_mpi_threads is set to 1 
even if MPI_Init() is called in place of MPI_Init_thread().
I saw also that  opal_using_threads() exists and was used by other BTLs.

Maybe the solution is to find the way to set enable_mpi_threads to 0 when 
MPI_Init() is called.


De : devel [mailto:devel-boun...@open-mpi.org] De la part de George Bosilca
Envoyé : vendredi 12 décembre 2014 07:03
À : Open MPI Developers
Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
ompi/runtime/ompi_mpi_init.c is called to late

On Thu, Dec 11, 2014 at 8:30 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Just to help me understand: I don’t think this change actually changed any 
behavior. However, it certainly *allows* a different behavior. Isn’t that true?

It depends how you look at this. To be extremely clear it prevents the modules 
from using anything else than their arguments to decide the provided threading 
model. With the current change, it is possible that some of the modules will 
continue to follow this "old" behavior, while others might switch to check 
opal_using_threads to see how they might behave.

My point here is not that one is better than the other, just that we 
inadvertently introduced a possibility for non-consistent behavior.

Let me take an example. In the old scheme, the PML was allowed to run each BTL 
in a separate thread, with absolutely no BTL support for thread safety. Thus, 
the PML could have managed all the interactions between BTL and requests in an 
atomic way, without the BTL knowing about. Now, if the BTL make his decision 
based on the value returned by opal_using_threads this approach is not possible 
anymore.

If so, I guess the real question is for Pascal at Bull: why do you feel this 
earlier setting is required?

This might allow to see if using functions that require protection, such as 
opal_lifo_push, will work by default or one should use directly their atomic 
version?

  George.



On Dec 11, 2014, at 4:21 PM, George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:

The overall design in OMPI was that no OMPI module should be allowed to decide 
if threads are on (thus it should not rely on the value returned by 
opal_using_threads during it's initialization stage). Instead, they should 
respect the level of thread support requested as an argument during the 
initialization step.

And this is true even for the BTLs. The PML component init function is 
propagating the  enable_progress_threads and enable_mpi_threads, down to the 
BML, and then to the BTL. This 2 variables, enable_progress_threads and 
enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute the 
the value of the opal) using_thread (and that this patch moved).

The setting of the opal_using_threads was delayed during the initialization to 
ensure that it's value was not used to select a specific thread-level in any 
module, a behavior that is allowed now with the new setting.

A drastic change in behavior...

  George.


On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Kewl - I’ll fix. Thanks!

On Dec 9, 2014, at 12:32 AM, Pascal Deveze 
mailto:pascal.dev...@bull.net>> wrote:

Hi Ralph,

This in in the trunk.

De : devel [mailto:devel-boun...@open-mpi.org] De la part de Ralph Castain
Envoyé : mardi 9 décembre 2014 09:32
À : Open MPI Developers
Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
ompi/runtime/ompi_mpi_init.c is called to late

Hi Pascal

Is this in the trunk or in the 1.8 series (or both)?


On Dec 9, 2014, at 12:28 AM, Pascal Deveze 
mailto:pascal.dev...@bull.net>> wrote:


In case where MPI is compiled with --enable-mpi-thread-multiple, a call to 
opal_using_threads() always returns 0 in the routine btl_xxx_component_init() 
of the BTLs, event if the application calls MPI_Init_thread() with 
MPI_THREAD_MULTIPLE.

This is because opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is 
called to late.

I propose the following patch that solves the problem for me:

diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
index 35509cf..c2370fc 100644
--- a/ompi/runtime/ompi_mpi_init.c
+++ b/ompi/runtime/ompi_mpi_init.c
@@ -512,6 +512,13 @@ int ompi_mpi_init(int argc, char **argv, int requested, 
int *provided)
 }
#endif

+/* If thread support was enabled, then setup OPAL to allow for
+   them. */
+if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
+(*provided != MPI_THREAD_SINGLE)) {
+opal_set_using_threads(true);
+}
+
 /* initialize datatypes. This step should be done early as it will
  * create the local convertor and local arch used in the proc
  * init.
@@ -724,13 +731,6 @@ int ompi_mpi_init(int argc, char **argv, int requested, 
int *provided)
goto error;
 }

-/* If thread support was enabled, then setup OPAL to allow for
-

[OMPI devel] [1.8.4rc3] dangling symlinks

2014-12-12 Thread Paul Hargrove

On a Linux system configured without java support I see the following two
dangling symlinks installed in ${prefix}/bin:

lrwxrwxrwx  1 phhargrove phhargrove 8 Dec 11 23:52 oshjavac -> mpijavac
lrwxrwxrwx  1 phhargrove phhargrove 8 Dec 11 23:52 shmemjavac -> mpijavac

It seems there is some logic missing to make installation of those links
conditional on Java support.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-12 Thread George Bosilca

On Thu, Dec 11, 2014 at 9:41 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  George,
>
> please allow me to jump in with naive comments ...
>
> currently (master) both openib and usnic btl invokes opal_using_threads in
> component_init() :
>
> btl_openib_component_init(int *num_btl_modules,
>   bool enable_progress_threads,
>   bool enable_mpi_threads)
> {
> [...]
> /* Currently refuse to run if MPI_THREAD_MULTIPLE is enabled */
> if (opal_using_threads() && !mca_btl_base_thread_multiple_override) {
> opal_output_verbose(5, opal_btl_base_framework.framework_output,
> "btl:openib: MPI_THREAD_MULTIPLE not
> suppported; skipping this component");
> goto no_btls;
> }
>

the code should not check for opal_using_threads(), instead it should check
the local argument enable_mpi_threads.


> The overall design in OMPI was that no OMPI module should be allowed to 
> decide if threads are on
>
>
> does "OMPI module" exclude OPAL and ORTE module ?
>

Every modele where the threading level is provided in the _init function. I
am not aware if there are any in ORTE, but there are certainly quite a few
in OPAL and OMPI.


> if yes, did the btl move from OMPI down to OPAL have any impact ?
>

Should not, the highest level defines the supported threading level (OMPI
in the current code).


>
> if not, then could/should opal_using_threads() abort and/or display an
> error message if it is called too early
> (at least in debug builds) ?
>

I don't think so. The thing I wanted to raise attention to is that solely
using opal_using _threads in a module initialization to decide the level of
thread support provided is wrong. This information can be used to augment
the decision already taken using the two provided arguments, but nothing
more (certainly not as shown in the above code).

That being said, I can also see how one might use the opal_using _threads()
to extend upon the provided level of thread safety. Let me take an example.
A BTL can provide internal async progress for all pending fragments as long
as it does not involve any other component (PML, mpool and so on). Thus,
even if the BTL is requested only minimal threading support during
initialization, it might check the opal_using_threads to see if it can
safely use functions such as opal_lifo_push or it should use the
opal_lifo_atomic_push instead. I know this example is somehow convoluted,
as one can always go for the version known to be atomic, but is the best I
found.

  George.



>
> Cheers,
>
> Gilles
>
> On 2014/12/12 10:30, Ralph Castain wrote:
>
> Just to help me understand: I don’t think this change actually changed any 
> behavior. However, it certainly *allows* a different behavior. Isn’t that 
> true?
>
> If so, I guess the real question is for Pascal at Bull: why do you feel this 
> earlier setting is required?
>
>
>
>  On Dec 11, 2014, at 4:21 PM, George Bosilca  
>  wrote:
>
> The overall design in OMPI was that no OMPI module should be allowed to 
> decide if threads are on (thus it should not rely on the value returned by 
> opal_using_threads during it's initialization stage). Instead, they should 
> respect the level of thread support requested as an argument during the 
> initialization step.
>
> And this is true even for the BTLs. The PML component init function is 
> propagating the  enable_progress_threads and enable_mpi_threads, down to the 
> BML, and then to the BTL. This 2 variables, enable_progress_threads and 
> enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute 
> the the value of the opal) using_thread (and that this patch moved).
>
> The setting of the opal_using_threads was delayed during the initialization 
> to ensure that it's value was not used to select a specific thread-level in 
> any module, a behavior that is allowed now with the new setting.
>
> A drastic change in behavior...
>
>   George.
>
>
> On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain   > wrote:
> Kewl - I’ll fix. Thanks!
>
>
>  On Dec 9, 2014, at 12:32 AM, Pascal Deveze   > wrote:
>
> Hi Ralph,
>
> This in in the trunk.
>
> De : devel [mailto:devel-boun...@open-mpi.org  
>  ] De la part 
> de Ralph Castain
> Envoyé : mardi 9 décembre 2014 09:32
> À : Open MPI Developers
> Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
> ompi/runtime/ompi_mpi_init.c is called to late
>
> Hi Pascal
>
> Is this in the trunk or in the 1.8 series (or both)?
>
>
> On Dec 9, 2014, at 12:28 AM, Pascal Deveze   > wrote:
>
>
> In case where MPI is compiled with --enable-mpi-thread-multiple, a call to 
> opal_using_threads() always returns 0 in the routine btl_xxx_component_init() 
> of the BTLs, event if the application calls MPI_Init_thread() with 
> MPI_THREAD_MULTIPLE.
>
> This is because opal_set_us

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Gilles Gouaillardet

Brice,

ompi master is based on hwloc 1.9.1, isn't it ?

if some backport is required for hwloc 1.7.2 (used by ompi v1.8), then
could you please update the hwloc v1.7 branch ?

Cheers,

Gilles

On 2014/12/12 15:16, Brice Goglin wrote:
> Yes.
>
> In theory, everything that's in hwloc/v1.8 should go to OMPI/master.
>
> And most of it should go to v1.8 too, but that may require some
> backporting rework. I can update hwloc/v1.7 if that helps.
>
> Brice
>
>
>
> Le 12/12/2014 03:10, Gilles Gouaillardet a écrit :
>> Brice,
>>
>> should this fix be backported to both master and v1.8 ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/12 7:46, Brice Goglin wrote:
>>> This problem was fixed in hwloc upstream recently.
>>>
>>> https://github.com/open-mpi/hwloc/commit/790aa2e1e62be6b4f37622959de9ce3766ebc57e
>>> Brice
>>>
>>>
>>> Le 11/12/2014 23:40, Jorge D'Elia a écrit :
 Dear Jeff,

 Our updates of OpenMPI to 1.8.3 (and 1.8.4) were 
 all OK using Fedora >= 17 and system gcc compilers
 on ia32 or ia64 machines. 

 However, the "make all" step failed using Fedora 14 
 with a beta gcc 5.0 compiler on an ia32 machine 
 with message like:

 Error: symbol `Lhwloc1' is already defined

 A roundabout way to solve it was perform, first, 
 a separated installation of the hwloc package (we use 
 Release v1.10.0 (stable)) and, second, configure 
 OpenMPI using its flag: 

   --with-hwloc=${HWLOC_HOME}

 although, in this way, the include and library path 
 must be given, e.g.

  export CFLAGS="-I/usr/beta/hwloc/include" ; echo ${CFLAGS}
  export LDFLAGS="-L/usr/beta/hwloc/lib"; echo ${LDFLAGS}
  export LIBS="-lhwloc" ; echo ${LIBS}

 In order to verify that the hwloc works OK, it would be useful 
 to include in the OpenMPI distribution a simple test like

 $ gcc ${CFLAGS} ${LDFLAGS} -o hwloc-hello.exe hwloc-hello.c ${LIBS}
 $ ./hwloc-hello.exe

 (we apologize to forget to use the --with-hwloc-libdir flag ...).

 With this previous step we could overcome the fatal error 
 in the configuration step related to the hwloc package.

 This (fixed) trouble in the configuration step is the same 
 as the reported as:

 Open MPI 1.8.1: "make all" error: symbol `Lhwloc1' is already defined

 on 2014-08-12 15:08:38


 Regards,
 Jorge.

 - Mensaje original -
> De: "Jorge D'Elia" 
> Para: "Open MPI Users" 
> Enviado: Martes, 12 de Agosto 2014 16:08:38
> Asunto: Re: [OMPI users] Open MPI 1.8.1: "make all" error: symbol 
> `Lhwloc1' is already defined
>
> Dear Jeff,
>
> These new versions of the tgz files replace the previous ones:
> I had used an old outdated session environment. However, the
> configuration and installation was OK again in each case.
> Sorry for the noise caused by the previous tgz files.
>
> Regards,
> Jorge.
>
> - Mensaje original -
>> De: "Jorge D'Elia" 
>> Para: "Open MPI Users" 
>> Enviados: Martes, 12 de Agosto 2014 15:16:19
>> Asunto: Re: [OMPI users] Open MPI 1.8.1: "make all" error: symbol 
>> `Lhwloc1'
>> is already defined
>>
>> Dear Jeff,
>>
>> - Mensaje original -
>>> De: "Jeff Squyres (jsquyres)" 
>>> Para: "Open MPI User's List" 
>>> Enviado: Lunes, 11 de Agosto 2014 11:47:29
>>> Asunto: Re: [OMPI users] Open MPI 1.8.1: "make all" error: symbol
>>> `Lhwloc1'
>>> is already defined
>>>
>>> The problem appears to be occurring in the hwloc component in OMPI.
>>> Can you download hwloc 1.7.2 (standalone) and try to build that on
>>> the target machine and see what happens?
>>>
>>> http://www.open-mpi.org/software/hwloc/v1.7/
>> OK. Just in case I tried both version 1.7.2 and 1.9 (stable).
>> Both gave no errors in the configuration or installation.
>> Attached a *.tgz file for each case. Greetings. Jorge.
>>
>>  
>>> On Aug 10, 2014, at 11:16 AM, Jorge D'Elia 
>>> wrote:
>>>
 Hi,

 I tried to re-compile Open MPI 1.8.1 version for Linux
 on an ia32 machine with Fedora 14 although using the
 last version of Gfortran (Gfortran 4.10 is required
 by a user program which runs ok).

 However, the "make all" phase breaks with the
 error message:

  Error: symbol `Lhwloc1' is already defined

 I attached a tgz file (tar -zcvf) with:

  Output "configure.txt" from "./configure" Open MPI phase;
  The "config.log" file from the top-level Open MPI directory;
  Output "make.txt"from "make all" to build Open MPI;
  Output "make-v1.txt" from "make V=1" to build Open MPI;
  Outputs from cat /proc/version and cat /proc/cpuinfo

 Please,

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Brice Goglin

Yes.

In theory, everything that's in hwloc/v1.8 should go to OMPI/master.

And most of it should go to v1.8 too, but that may require some
backporting rework. I can update hwloc/v1.7 if that helps.

Brice



Le 12/12/2014 03:10, Gilles Gouaillardet a écrit :
> Brice,
>
> should this fix be backported to both master and v1.8 ?
>
> Cheers,
>
> Gilles
>
> On 2014/12/12 7:46, Brice Goglin wrote:
>> This problem was fixed in hwloc upstream recently.
>>
>> https://github.com/open-mpi/hwloc/commit/790aa2e1e62be6b4f37622959de9ce3766ebc57e
>> Brice
>>
>>
>> Le 11/12/2014 23:40, Jorge D'Elia a écrit :
>>> Dear Jeff,
>>>
>>> Our updates of OpenMPI to 1.8.3 (and 1.8.4) were 
>>> all OK using Fedora >= 17 and system gcc compilers
>>> on ia32 or ia64 machines. 
>>>
>>> However, the "make all" step failed using Fedora 14 
>>> with a beta gcc 5.0 compiler on an ia32 machine 
>>> with message like:
>>>
>>> Error: symbol `Lhwloc1' is already defined
>>>
>>> A roundabout way to solve it was perform, first, 
>>> a separated installation of the hwloc package (we use 
>>> Release v1.10.0 (stable)) and, second, configure 
>>> OpenMPI using its flag: 
>>>
>>>   --with-hwloc=${HWLOC_HOME}
>>>
>>> although, in this way, the include and library path 
>>> must be given, e.g.
>>>
>>>  export CFLAGS="-I/usr/beta/hwloc/include" ; echo ${CFLAGS}
>>>  export LDFLAGS="-L/usr/beta/hwloc/lib"; echo ${LDFLAGS}
>>>  export LIBS="-lhwloc" ; echo ${LIBS}
>>>
>>> In order to verify that the hwloc works OK, it would be useful 
>>> to include in the OpenMPI distribution a simple test like
>>>
>>> $ gcc ${CFLAGS} ${LDFLAGS} -o hwloc-hello.exe hwloc-hello.c ${LIBS}
>>> $ ./hwloc-hello.exe
>>>
>>> (we apologize to forget to use the --with-hwloc-libdir flag ...).
>>>
>>> With this previous step we could overcome the fatal error 
>>> in the configuration step related to the hwloc package.
>>>
>>> This (fixed) trouble in the configuration step is the same 
>>> as the reported as:
>>>
>>> Open MPI 1.8.1: "make all" error: symbol `Lhwloc1' is already defined
>>>
>>> on 2014-08-12 15:08:38
>>>
>>>
>>> Regards,
>>> Jorge.
>>>
>>> - Mensaje original -
 De: "Jorge D'Elia" 
 Para: "Open MPI Users" 
 Enviado: Martes, 12 de Agosto 2014 16:08:38
 Asunto: Re: [OMPI users] Open MPI 1.8.1: "make all" error: symbol 
 `Lhwloc1' is already defined

 Dear Jeff,

 These new versions of the tgz files replace the previous ones:
 I had used an old outdated session environment. However, the
 configuration and installation was OK again in each case.
 Sorry for the noise caused by the previous tgz files.

 Regards,
 Jorge.

 - Mensaje original -
> De: "Jorge D'Elia" 
> Para: "Open MPI Users" 
> Enviados: Martes, 12 de Agosto 2014 15:16:19
> Asunto: Re: [OMPI users] Open MPI 1.8.1: "make all" error: symbol 
> `Lhwloc1'
> is already defined
>
> Dear Jeff,
>
> - Mensaje original -
>> De: "Jeff Squyres (jsquyres)" 
>> Para: "Open MPI User's List" 
>> Enviado: Lunes, 11 de Agosto 2014 11:47:29
>> Asunto: Re: [OMPI users] Open MPI 1.8.1: "make all" error: symbol
>> `Lhwloc1'
>> is already defined
>>
>> The problem appears to be occurring in the hwloc component in OMPI.
>> Can you download hwloc 1.7.2 (standalone) and try to build that on
>> the target machine and see what happens?
>>
>> http://www.open-mpi.org/software/hwloc/v1.7/
> OK. Just in case I tried both version 1.7.2 and 1.9 (stable).
> Both gave no errors in the configuration or installation.
> Attached a *.tgz file for each case. Greetings. Jorge.
>
>  
>> On Aug 10, 2014, at 11:16 AM, Jorge D'Elia 
>> wrote:
>>
>>> Hi,
>>>
>>> I tried to re-compile Open MPI 1.8.1 version for Linux
>>> on an ia32 machine with Fedora 14 although using the
>>> last version of Gfortran (Gfortran 4.10 is required
>>> by a user program which runs ok).
>>>
>>> However, the "make all" phase breaks with the
>>> error message:
>>>
>>>  Error: symbol `Lhwloc1' is already defined
>>>
>>> I attached a tgz file (tar -zcvf) with:
>>>
>>>  Output "configure.txt" from "./configure" Open MPI phase;
>>>  The "config.log" file from the top-level Open MPI directory;
>>>  Output "make.txt"from "make all" to build Open MPI;
>>>  Output "make-v1.txt" from "make V=1" to build Open MPI;
>>>  Outputs from cat /proc/version and cat /proc/cpuinfo
>>>
>>> Please, any clue in order to fix?
>>>
>>> Regards in advance.
>>> Jorge.
>>>
>>> --
>>> CIMEC (UNL-CONICET) Predio CONICET-Santa Fe, Colectora Ruta Nac 168,
>>> Paraje El Pozo, S3000GLN Santa Fe, ARGENTINA, http://www.cimec.org.ar/
>>> Tel +54 342 451.15.94/95 ext 1018, fax: +54-342-451.11.69
>>> ___
>>> use

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-12 Thread George Bosilca

On Thu, Dec 11, 2014 at 8:30 PM, Ralph Castain  wrote:

> Just to help me understand: I don’t think this change actually changed any
> behavior. However, it certainly *allows* a different behavior. Isn’t that
> true?
>

It depends how you look at this. To be extremely clear it prevents the
modules from using anything else than their arguments to decide the
provided threading model. With the current change, it is possible that some
of the modules will continue to follow this "old" behavior, while others
might switch to check opal_using_threads to see how they might behave.

My point here is not that one is better than the other, just that we
inadvertently introduced a possibility for non-consistent behavior.

Let me take an example. In the old scheme, the PML was allowed to run each
BTL in a separate thread, with absolutely no BTL support for thread safety.
Thus, the PML could have managed all the interactions between BTL and
requests in an atomic way, without the BTL knowing about. Now, if the BTL
make his decision based on the value returned by opal_using_threads this
approach is not possible anymore.


> If so, I guess the real question is for Pascal at Bull: why do you feel
> this earlier setting is required?
>

This might allow to see if using functions that require protection, such as
opal_lifo_push, will work by default or one should use directly their
atomic version?

  George.


>
>
> On Dec 11, 2014, at 4:21 PM, George Bosilca  wrote:
>
> The overall design in OMPI was that no OMPI module should be allowed to
> decide if threads are on (thus it should not rely on the value returned by 
> opal_using_threads
> during it's initialization stage). Instead, they should respect the level
> of thread support requested as an argument during the initialization step.
>
> And this is true even for the BTLs. The PML component init function is
> propagating the  enable_progress_threads and enable_mpi_threads, down to
> the BML, and then to the BTL. This 2 variables, enable_progress_threads and
> enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute
> the the value of the opal) using_thread (and that this patch moved).
>
> The setting of the opal_using_threads was delayed during the
> initialization to ensure that it's value was not used to select a specific
> thread-level in any module, a behavior that is allowed now with the new
> setting.
>
> A drastic change in behavior...
>
>   George.
>
>
> On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain  wrote:
>
>> Kewl - I’ll fix. Thanks!
>>
>> On Dec 9, 2014, at 12:32 AM, Pascal Deveze 
>> wrote:
>>
>> Hi Ralph,
>>
>> This in in the trunk.
>>
>> *De :* devel [mailto:devel-boun...@open-mpi.org
>> ] *De la part de* Ralph Castain
>> *Envoyé :* mardi 9 décembre 2014 09:32
>> *À :* Open MPI Developers
>> *Objet :* Re: [OMPI devel] Patch proposed: opal_set_using_threads(true)
>> in ompi/runtime/ompi_mpi_init.c is called to late
>>
>> Hi Pascal
>>
>> Is this in the trunk or in the 1.8 series (or both)?
>>
>>
>>
>> On Dec 9, 2014, at 12:28 AM, Pascal Deveze 
>> wrote:
>>
>>
>> In case where MPI is compiled with --enable-mpi-thread-multiple, a call
>> to opal_using_threads() always returns 0 in the routine
>> btl_xxx_component_init() of the BTLs, event if the application calls
>> MPI_Init_thread() with MPI_THREAD_MULTIPLE.
>>
>> This is because opal_set_using_threads(true) in
>> ompi/runtime/ompi_mpi_init.c is called to late.
>>
>> I propose the following patch that solves the problem for me:
>>
>> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
>> index 35509cf..c2370fc 100644
>> --- a/ompi/runtime/ompi_mpi_init.c
>> +++ b/ompi/runtime/ompi_mpi_init.c
>> @@ -512,6 +512,13 @@ int ompi_mpi_init(int argc, char **argv, int
>> requested, int *provided)
>>  }
>> #endif
>>
>> +/* If thread support was enabled, then setup OPAL to allow for
>> +   them. */
>> +if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
>> +(*provided != MPI_THREAD_SINGLE)) {
>> +opal_set_using_threads(true);
>> +}
>> +
>>  /* initialize datatypes. This step should be done early as it will
>>   * create the local convertor and local arch used in the proc
>>   * init.
>> @@ -724,13 +731,6 @@ int ompi_mpi_init(int argc, char **argv, int
>> requested, int *provided)
>> goto error;
>>  }
>>
>> -/* If thread support was enabled, then setup OPAL to allow for
>> -   them. */
>> -if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
>> -(*provided != MPI_THREAD_SINGLE)) {
>> -opal_set_using_threads(true);
>> -}
>> -
>>  /* start PML/BTL's */
>>  ret = MCA_PML_CALL(enable(true));
>>  if( OMPI_SUCCESS != ret ) {
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16459.php
>>
>>
>> ___

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Paul Hargrove

Ralph,

The "understanding" Gilles just expresses matches my own.

The issue that the OP observed on an ARM/Linux system (and I was able to
reproduce on Linux w/ any arch) is that when the LO interface is missing
Linux is unable to pass loopback messages sent on ANY interface.  The
oob_tcp code was trying to connect to a 172.18.0.x address when I
reproduced it.

In summary:

For LINUX the lack of a loopback interface (selected or not) prevents local
connection.
For NON-LINUX the lack of a loopback interface MAKES NO DIFFERENCE.

So, I think Gilles's version is correct, but that making the logic (at
least the reporting) conditional on Linux might be an improvement.

Since this is a warning, it might be better to remove from 1.8 until we
have more certainty about where/when it matters.  I don't think users will
appreciate a "cry wolf" release.

-Paul

On Thu, Dec 11, 2014 at 9:01 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Ralph,
>
> here is my understanding of what happens on Linux :
>
> lo: 127.0.0.1/8
> eth0: 192.168.122.101/24
>
> mpirun --mca orte_oob_tcp_if_include eth0 ...
>
> so the mpi task tries to contact orted/mpirun on 192.168.0.1/24
>
> that works just fine if the loopback interface is active,
> and that hangs if there is no loopback interface.
>
>
> imho that is a linux oddity, and OMPI has nothing to do with it
>
> Cheers,
>
> Gilles
>
> [root@slurm1 ~]# ping -c 3 192.168.122.101
> PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
> 64 bytes from 192.168.122.101: icmp_seq=1 ttl=64 time=0.013 ms
> 64 bytes from 192.168.122.101: icmp_seq=2 ttl=64 time=0.009 ms
> 64 bytes from 192.168.122.101: icmp_seq=3 ttl=64 time=0.011 ms
>
> --- 192.168.122.101 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
> rtt min/avg/max/mdev = 0.009/0.011/0.013/0.001 ms
>
>
>
> [root@slurm1 ~]# ifdown lo
> [root@slurm1 ~]# ping -c 3 192.168.122.101
> PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
>
> --- 192.168.122.101 ping statistics ---
> 3 packets transmitted, 0 received, 100% packet loss, time 11999ms
>
>
>
> On 2014/12/12 13:54, Ralph Castain wrote:
>
> I honestly think it has to be a selected interface, Gilles, else we will fail 
> to connect.
>
>
>  On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet 
>   wrote:
>
> Paul,
>
> about the five warnings :
> can you confirm you are running mpirun *not* on n15 nor n16 ?
> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 orted 
> + 2 mpi tasks
>
> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your 
> openmpi-mca-params.conf ?
>
> here is attached a patch to fix this issue.
> what we really want is test there is a loopback interface, period.
> the current code (my bad for not having reviewed in a timely manner) seems to 
> check
> there is a *selected* loopback interface.
>
> Cheers,
>
> Gilles
>
> On 2014/12/12 13:15, Paul Hargrove wrote:
>
>  Ralph,
>
> Sorry to be the bearer of more bad news.
> The "good" news is I've seen the new warning regarding the lack of a
> loopback interface.
> The BAD news is that it is occurring on a Linux cluster that I'ver verified
> DOES have 'lo' configured on the front-end and compute nodes (UP and
> RUNNING according to ifconfig).
>
> Though run with "-np 2" the warning appears FIVE times.
> ADDITIONALLY, there is a SEGV at exit!
>
> Unfortunately, despite configuring with --enable-debug, I didn't get line
> numbers from the core (and there was no backtrace printed).
>
> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>
> Let me know what tracing flags to apply to gather the info needed to debug
> this.
>
> -Paul
>
>
> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
> --
> WARNING: No loopback interface was found. This can cause problems
> when we spawn processes as they are likely to be unable to connect
> back to their host daemon. Sadly, it may take awhile for the connect
> attempt to fail, so you may experience a significant hang time.
>
> You may wish to ctrl-c out of your job and activate loopback support
> on at least one interface before trying again.
>
> --
> [... above message FOUR more times ...]
> Process 1 exiting
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> --
> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
> 11 (Segmen

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Paul Hargrove

Gilles,

You are correct that mpirun is executed on a node other than n15 or n16.
So, your count to 5 makes sense.
It does seem a bit excessive, but it should only occur when there is
problem.

I have no MCA params file nor any MCA-related environment variables.
So, there are no oob_tcp_if_{include,exclude} settings in force.

The patch makes sense to me and appears to fix the problem.
I'll address Ralph's concern about selected-vs-unselected interface
separately.

I still get the SEGV at exit, but that could very well be bit-rot in
mtl/shm.
I will investigate more if/when I have time.

-Paul

On Thu, Dec 11, 2014 at 8:26 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Paul,
>
> about the five warnings :
> can you confirm you are running mpirun *not* on n15 nor n16 ?
> if my guess is correct, then you can get up to 5 warnings : mpirun + 2
> orted + 2 mpi tasks
>
> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your
> openmpi-mca-params.conf ?
>
> here is attached a patch to fix this issue.
> what we really want is test there is a loopback interface, period.
> the current code (my bad for not having reviewed in a timely manner) seems
> to check
> there is a *selected* loopback interface.
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/12 13:15, Paul Hargrove wrote:
>
> Ralph,
>
> Sorry to be the bearer of more bad news.
> The "good" news is I've seen the new warning regarding the lack of a
> loopback interface.
> The BAD news is that it is occurring on a Linux cluster that I'ver verified
> DOES have 'lo' configured on the front-end and compute nodes (UP and
> RUNNING according to ifconfig).
>
> Though run with "-np 2" the warning appears FIVE times.
> ADDITIONALLY, there is a SEGV at exit!
>
> Unfortunately, despite configuring with --enable-debug, I didn't get line
> numbers from the core (and there was no backtrace printed).
>
> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>
> Let me know what tracing flags to apply to gather the info needed to debug
> this.
>
> -Paul
>
>
> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
> --
> WARNING: No loopback interface was found. This can cause problems
> when we spawn processes as they are likely to be unable to connect
> back to their host daemon. Sadly, it may take awhile for the connect
> attempt to fail, so you may experience a significant hang time.
>
> You may wish to ctrl-c out of your job and activate loopback support
> on at least one interface before trying again.
>
> --
> [... above message FOUR more times ...]
> Process 1 exiting
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> --
> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
> 11 (Segmentation fault).
> --
>
> $ /sbin/ifconfig lo
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:81039065 (77.2 MiB)  TX bytes:81039065 (77.2 MiB)
>
> $ ssh n15 /sbin/ifconfig lo
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:1509940 (1.4 MiB)  TX bytes:1509940 (1.4 MiB)
>
> $ ssh n16 /sbin/ifconfig lo
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:24938 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:1543408 (1.4 MiB)  TX bytes:1543408 (1.4 MiB)
>
> $ gdb examples/ring_c core.29728
> [...]
> (gdb) where
> #0  0x002a97a19980 in ?? ()
> #1  
> #2  0x003a6d40607c in _Unwind_FindEnclosingFunction () from
> /lib64/libgcc_s

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Gilles Gouaillardet

Ralph,

here is my understanding of what happens on Linux :

lo: 127.0.0.1/8
eth0: 192.168.122.101/24

mpirun --mca orte_oob_tcp_if_include eth0 ...

so the mpi task tries to contact orted/mpirun on 192.168.0.1/24

that works just fine if the loopback interface is active,
and that hangs if there is no loopback interface.


imho that is a linux oddity, and OMPI has nothing to do with it

Cheers,

Gilles

[root@slurm1 ~]# ping -c 3 192.168.122.101
PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
64 bytes from 192.168.122.101: icmp_seq=1 ttl=64 time=0.013 ms
64 bytes from 192.168.122.101: icmp_seq=2 ttl=64 time=0.009 ms
64 bytes from 192.168.122.101: icmp_seq=3 ttl=64 time=0.011 ms

--- 192.168.122.101 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.009/0.011/0.013/0.001 ms



[root@slurm1 ~]# ifdown lo
[root@slurm1 ~]# ping -c 3 192.168.122.101
PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.

--- 192.168.122.101 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 11999ms


On 2014/12/12 13:54, Ralph Castain wrote:
> I honestly think it has to be a selected interface, Gilles, else we will fail 
> to connect.
>
>> On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet 
>>  wrote:
>>
>> Paul,
>>
>> about the five warnings :
>> can you confirm you are running mpirun *not* on n15 nor n16 ?
>> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 orted 
>> + 2 mpi tasks
>>
>> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your 
>> openmpi-mca-params.conf ?
>>
>> here is attached a patch to fix this issue.
>> what we really want is test there is a loopback interface, period.
>> the current code (my bad for not having reviewed in a timely manner) seems 
>> to check
>> there is a *selected* loopback interface.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/12 13:15, Paul Hargrove wrote:
>>> Ralph,
>>>
>>> Sorry to be the bearer of more bad news.
>>> The "good" news is I've seen the new warning regarding the lack of a
>>> loopback interface.
>>> The BAD news is that it is occurring on a Linux cluster that I'ver verified
>>> DOES have 'lo' configured on the front-end and compute nodes (UP and
>>> RUNNING according to ifconfig).
>>>
>>> Though run with "-np 2" the warning appears FIVE times.
>>> ADDITIONALLY, there is a SEGV at exit!
>>>
>>> Unfortunately, despite configuring with --enable-debug, I didn't get line
>>> numbers from the core (and there was no backtrace printed).
>>>
>>> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>>>
>>> Let me know what tracing flags to apply to gather the info needed to debug
>>> this.
>>>
>>> -Paul
>>>
>>>
>>> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
>>> --
>>> WARNING: No loopback interface was found. This can cause problems
>>> when we spawn processes as they are likely to be unable to connect
>>> back to their host daemon. Sadly, it may take awhile for the connect
>>> attempt to fail, so you may experience a significant hang time.
>>>
>>> You may wish to ctrl-c out of your job and activate loopback support
>>> on at least one interface before trying again.
>>>
>>> --
>>> [... above message FOUR more times ...]
>>> Process 1 exiting
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>> Process 0 decremented value: 8
>>> Process 0 decremented value: 7
>>> Process 0 decremented value: 6
>>> Process 0 decremented value: 5
>>> Process 0 decremented value: 4
>>> Process 0 decremented value: 3
>>> Process 0 decremented value: 2
>>> Process 0 decremented value: 1
>>> Process 0 decremented value: 0
>>> Process 0 exiting
>>> --
>>> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
>>> 11 (Segmentation fault).
>>> --
>>>
>>> $ /sbin/ifconfig lo
>>> loLink encap:Local Loopback
>>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>>   inet6 addr: ::1/128 Scope:Host
>>>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>>   RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:0
>>>   RX bytes:81039065 (77.2 MiB)  TX bytes:81039065 (77.2 MiB)
>>>
>>> $ ssh n15 /sbin/ifconfig lo
>>> loLink encap:Local Loopback
>>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>>   inet6 addr: ::1/128 Scope:Host
>>>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>>   RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:24885 errors:0 dropp

54 matches

Mail list logo