Re: [OMPI devel] v1.8.2 still held up...

2014-08-08 Thread Paul Hargrove
On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain  wrote:

> * fixes to coll/ml that expanded to fixing page alignment in general -
> someone needs to review/approve it:
> https://svn.open-mpi.org/trac/ompi/ticket/4826
>

I've been able to confirm that the nightly tarball (1.8.2rc4r32480) works
as expected on the SPARC and PPC64 platforms where I had reproduced the
problem previously.  I won't have access to the IA64 platform (which also
has pagesize != 4K) until about 6 hours from now, but have no doubt the fix
will work there too.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.2 still held up...

2014-08-08 Thread Paul Hargrove
On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain  wrote:

> * static linking failure - Gilles has posted a proposed fix, but somebody
> needs to approve and CMR it. Please see:
> https://svn.open-mpi.org/trac/ompi/ticket/4834
>


Jeff moved the fix to v1.8 in r32471.
I have tested tonight's tarball (1.8.2rc4r32480) and found the problem to
be resolved on all tested OSes (linux, macos, freebsd, netbsd, openbsd,
solaris-10 and solaris-11).

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-08 Thread Hjelm, Nathan Thomas
I will try to take a look this week and see what I can do.

-Nathan

From: devel [devel-boun...@open-mpi.org] on behalf of George Bosilca 
[bosi...@icl.utk.edu]
Sent: Thursday, August 07, 2014 10:37 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old 
value

Paul's tests identified an small issue with the previous patch (a real 
corner-case for ARM v5). The patch below is fixing all known issues.

Btw, there is still room for volunteers for the .asm work.

  George.



On Tue, Aug 5, 2014 at 2:23 PM, George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Thanks to Paul help all the inlined atomics have been tested. The new patch is 
attached below. However, this only fixes the inline atomics, all those 
generated from the *.asm files have not been updated. Any volunteer?

  George.



On Aug 1, 2014, at 18:09 , Paul Hargrove 
mailto:phhargr...@lbl.gov>> wrote:

I have confirmed that George's latest version works on both SPARC ABIs.

ARMv7 and three MIPS ABIs still pending...

-Paul


On Fri, Aug 1, 2014 at 9:40 AM, George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Another version of the atomic patch. Paul has tested it on a bunch of 
platforms. At this point we have confirmation from all architectures except 
SPARC (v8+ and v9).

  George.



On Jul 31, 2014, at 19:13 , George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:

> All,
>
> Here is the patch that change the meaning of the atomics to make them always 
> return the previous value (similar to sync_fetch_and_<*>). I tested this with 
> the following atomics: OS X, gcc style intrinsics and AMD64.
>
> I did not change the base assembly files used when GCC style assembly 
> operations are not supported. If someone feels like fixing them, feel free.
>
> Paul, I know you have a pretty diverse range computers. Can you try to 
> compile and run a “make check” with the following patch?
>
>  George.
>
> 
>
> On Jul 30, 2014, at 15:21 , Nathan Hjelm 
> mailto:hje...@lanl.gov>> wrote:
>
>>
>> That is what I would prefer. I was trying to not disturb things too
>> much :). Please bring the changes over!
>>
>> -Nathan
>>
>> On Wed, Jul 30, 2014 at 03:18:44PM -0400, George Bosilca wrote:
>>>  Why do you want to add new versions? This will lead to having two, almost
>>>  identical, sets of atomics that are conceptually equivalent but different
>>>  in terms of code. And we will have to maintained both!
>>>  I did a similar change in a fork of OPAL in another project but instead of
>>>  adding another flavor of atomics, I completely replaced the available ones
>>>  with a set returning the old value. I can bring the code over.
>>>George.
>>>
>>>  On Tue, Jul 29, 2014 at 5:29 PM, Paul Hargrove 
>>> mailto:phhargr...@lbl.gov>> wrote:
>>>
>>>On Tue, Jul 29, 2014 at 2:10 PM, Nathan Hjelm 
>>> mailto:hje...@lanl.gov>> wrote:
>>>
>>>  Is there a reason why the
>>>  current implementations of opal atomics (add, cmpset) do not return
>>>  the
>>>  old value?
>>>
>>>Because some CPUs don't implement such an atomic instruction?
>>>
>>>On any CPU one *can* certainly synthesize the desired operation with an
>>>added read before the compare-and-swap to return a value that was
>>>present at some time before a failed cmpset.  That is almost certainly
>>>sufficient for your purposes.  However, the added load makes it
>>>(marginally) more expensive on some CPUs that only have the native
>>>equivalent of gcc's __sync_bool_compare_and_swap().



Re: [OMPI devel] ORTE headers in OPAL source

2014-08-08 Thread George Bosilca
These are harmless. They are only used when FT is enabled which should
rarely be the case.

  George.



On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) 
wrote:

> Here's a few ORTE headers in OPAL source -- can respective owners clean
> these up?  Thanks.
>
> -
> mca/btl/smcuda/btl_smcuda.c
> 63:#include "orte/mca/sstore/sstore.h"
>
> mca/btl/sm/btl_sm.c
> 62:#include "orte/mca/sstore/sstore.h"
>
> mca/mpool/sm/mpool_sm_module.c
> 34:#include "orte/mca/sstore/sstore.h"
> -
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
>


[OMPI devel] ORTE headers in OPAL source

2014-08-08 Thread Jeff Squyres (jsquyres)
Here's a few ORTE headers in OPAL source -- can respective owners clean these 
up?  Thanks.

-
mca/btl/smcuda/btl_smcuda.c
63:#include "orte/mca/sstore/sstore.h"

mca/btl/sm/btl_sm.c
62:#include "orte/mca/sstore/sstore.h"

mca/mpool/sm/mpool_sm_module.c
34:#include "orte/mca/sstore/sstore.h"
-

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] ompi headers in OPAL source

2014-08-08 Thread Jeff Squyres (jsquyres)
I found a few more OMPI header files included in OPAL source code.  Can the 
respective owners clean this stuff up?

Thanks!

-
mca/btl/openib/btl_openib_component.c
87:#include "ompi/mca/rte/rte.h"

mca/btl/ugni/btl_ugni_component.c
20:#include "ompi/runtime/params.h"

mca/btl/ugni/btl_ugni_add_procs.c
20:#include "ompi/communicator/communicator.h"

mca/btl/usnic/btl_usnic_hwloc.c
33:#include "ompi/mca/rte/rte.h"

mca/btl/usnic/btl_usnic_compat.h
43:#  include "ompi/mca/rte/rte.h"

mca/common/ofacm/common_ofacm_xoob.c
24:#include "ompi/mca/rte/rte.h"

mca/common/ofacm/common_ofacm_oob.c
35:#include "ompi/mca/rte/rte.h"

mca/mpool/base/mpool_base_alloc.c
32:#include "ompi/info/info.h" /* TODO */

mca/mpool/sm/mpool_sm_module.c
36:#include "ompi/runtime/ompi_cr.h" /* TODO */
-

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Yes, I know - but the problem comes from nidmap pushing data down into the 
opal_db/dstore level, which then creates a copy of the data. That's where the 
alignment error is generated


On Aug 8, 2014, at 11:17 AM, George Bosilca  wrote:

> On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain  wrote:
> Sorry to chime in a little late. George is likely correct about using 
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
> datatype looks like. This was the original reason for creating the 
> opal_identifier_t type - I had no other choice when we moved the db framework 
> (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. 
> The abstraction requirement wouldn't allow me to pass down the structure 
> definition.
> 
> We are talking about nidmap.c which has not yet been moved down to OPAL. 
> 
>   George.
>  
> 
> The easiest solution is probably to change the opal/db/hash code so that 
> 64-bit fields are memcpy'd instead of simply passed by "=". This should 
> eliminate the problem with the least fuss.
> 
> There is a performance penalty for using non-aligned data, and ideally we 
> should use aligned data whenever possible. This code isn't in the critical 
> path and so this is less of an issue, but still would be nice to do. However, 
> I didn't do so for the following reasons:
> 
> * I couldn't find a way for the compiler to check/require alignment down in 
> opal_db.store when passed a parameter. If someone knows of a way to do that, 
> please feel free to suggest it
> 
> * none of our current developers have access to a Solaris SPARC machine, and 
> thus our developers cannot detect violations when they occur
> 
> * the current solution avoids the issue, albeit with a slight performance 
> penalty
> 
> I'm open to alternative methods - I'm not happy with the ugliness this 
> required, but couldn't come up with a cleaner solution that would be easy for 
> developers to know when they violated the alignment requirement.
> 
> FWIW: it is possible, I suppose, that the other discussion about using an 
> opal_process_name_t that exactly mirrors orte_process_name_t could also 
> resolve this problem in a cleaner fashion. I didn't impose that requirement 
> here, but maybe it's another motivator for doing so?
> 
> Ralph
> 
> 
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
>  wrote:
> 
>> George,
>> 
>> (one of the) faulty line was :
>> 
>>if (ORTE_SUCCESS != (rc = 
>> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>> 
>> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>> 
>> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
>> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
>> issue (i have no arch to test...)
>> 
>> i was initially also "confused" with the following line
>> 
>> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
>> OPAL_SCOPE_INTERNAL,
>> ORTE_DB_NPROC_OFFSET, 
>> &offset, OPAL_UINT32))) {
>> 
>> the first argument of store is an (opal_identifier_t *)
>> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
>> might not be 64 bits aligned.
>> /* that being said, there is no crash :-) */
>> 
>> in this case, opal_db.store pointer points to the store function 
>> (db_hash.c:178)
>> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
>> required.
>> (and comment is explicit : /* to protect alignment, copy the data across */
>> 
>> that might sounds pedantic, but are we doing the right thing here ?
>> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
>> pointer was not 64 bits aligned
>> vs always use aligned data ?)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/08 14:58, George Bosilca wrote:
>>> This is a gigantic patch for an almost trivial issue. The current problem
>>> is purely related to the fact that in a single location (nidmap.c) the
>>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>>> aligned based on the uint64_t requirements. Bad assumption!
>>> 
>>> Looking at the code one might notice that the orte_process_name_t is stored
>>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>>> on the SPARC architecture because the two types (int32_t and int64_t) have
>>> different alignments.  However, ORTE define a type for orte_process_name_t.
>>> Thus, I think that if instead of saving the orte_process_name_t as an
>>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>> 
 Kawashima-san and all,
 
 Here is attached a one off patch for v1.8.
 /* it does not use the __attribute__ modifier that might not be
 supported by all compi

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Paul Hargrove
I will attempt to confirm on my Solaris-10 system ASAP.
That will allow me to finally be certain that the other static linking
issue has been resolved.

-Paul


On Fri, Aug 8, 2014 at 11:39 AM, Jeff Squyres (jsquyres)  wrote:

> Thanks!
>
> On Aug 8, 2014, at 2:30 PM, George Bosilca  wrote:
>
> > r32467 should fix the problem.
> >
> >   George.
> >
> >
> > On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > That'll do it...
> >
> > George: can you fix?
> >
> >
> > On Aug 8, 2014, at 1:11 PM, Ralph Castain  wrote:
> >
> > > I think it might be getting pulled in from this include:
> > >
> > > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h"
> > >
> > >
> > > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > >
> > >> Weirdness; I don't see any name like that in the SM BTL.
> > >>
> > >> I see it used in the OMPI layer... not sure how it's being using down
> in the btl SM component file...?
> > >>
> > >>
> > >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove 
> wrote:
> > >>
> > >>> Testing r32448 on trunk for trac issue #4834, I encounter the
> following which appears unrelated to #4834:
> > >>>
> > >>>  CCLD orte-info
> > >>> Undefined   first referenced
> > >>> symbol in file
> > >>> ompi_proc_local_proc
>  
> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
> > >>> ld: fatal: Symbol referencing errors. No output written to orte-info
> > >>>
> > >>> Note that this is *static* linking.
> > >>>
> > >>> This appears to indicate a call from OPAL to OMPI, and I am guessing
> this is a side-effect of the BTL move.
> > >>>
> > >>> Since OMPI contains (many) calls to OPAL this is a circular library
> dependence.
> > >>> Unfortunately, some linkers process their argument strictly
> left-to-right.
> > >>> Thus if this dependence is not eliminated one may need "-lmpi
> -lopen-pal -lmpi" (or similar) to resolve it.
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15565.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15566.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Jeff Squyres (jsquyres)
Thanks!

On Aug 8, 2014, at 2:30 PM, George Bosilca  wrote:

> r32467 should fix the problem.
> 
>   George.
> 
> 
> On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres)  
> wrote:
> That'll do it...
> 
> George: can you fix?
> 
> 
> On Aug 8, 2014, at 1:11 PM, Ralph Castain  wrote:
> 
> > I think it might be getting pulled in from this include:
> >
> > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h"
> >
> >
> > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres)  
> > wrote:
> >
> >> Weirdness; I don't see any name like that in the SM BTL.
> >>
> >> I see it used in the OMPI layer... not sure how it's being using down in 
> >> the btl SM component file...?
> >>
> >>
> >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove  wrote:
> >>
> >>> Testing r32448 on trunk for trac issue #4834, I encounter the following 
> >>> which appears unrelated to #4834:
> >>>
> >>>  CCLD orte-info
> >>> Undefined   first referenced
> >>> symbol in file
> >>> ompi_proc_local_proc
> >>> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
> >>> ld: fatal: Symbol referencing errors. No output written to orte-info
> >>>
> >>> Note that this is *static* linking.
> >>>
> >>> This appears to indicate a call from OPAL to OMPI, and I am guessing this 
> >>> is a side-effect of the BTL move.
> >>>
> >>> Since OMPI contains (many) calls to OPAL this is a circular library 
> >>> dependence.
> >>> Unfortunately, some linkers process their argument strictly left-to-right.
> >>> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal 
> >>> -lmpi" (or similar) to resolve it.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15565.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread George Bosilca
r32467 should fix the problem.

  George.


On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) 
wrote:

> That'll do it...
>
> George: can you fix?
>
>
> On Aug 8, 2014, at 1:11 PM, Ralph Castain  wrote:
>
> > I think it might be getting pulled in from this include:
> >
> > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h"
> >
> >
> > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) 
> wrote:
> >
> >> Weirdness; I don't see any name like that in the SM BTL.
> >>
> >> I see it used in the OMPI layer... not sure how it's being using down
> in the btl SM component file...?
> >>
> >>
> >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove  wrote:
> >>
> >>> Testing r32448 on trunk for trac issue #4834, I encounter the
> following which appears unrelated to #4834:
> >>>
> >>>  CCLD orte-info
> >>> Undefined   first referenced
> >>> symbol in file
> >>> ompi_proc_local_proc
>  
> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
> >>> ld: fatal: Symbol referencing errors. No output written to orte-info
> >>>
> >>> Note that this is *static* linking.
> >>>
> >>> This appears to indicate a call from OPAL to OMPI, and I am guessing
> this is a side-effect of the BTL move.
> >>>
> >>> Since OMPI contains (many) calls to OPAL this is a circular library
> dependence.
> >>> Unfortunately, some linkers process their argument strictly
> left-to-right.
> >>> Thus if this dependence is not eliminated one may need "-lmpi
> -lopen-pal -lmpi" (or similar) to resolve it.
>


Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain  wrote:

> Sorry to chime in a little late. George is likely correct about using
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that
> datatype looks like. This was the original reason for creating the
> opal_identifier_t type - I had no other choice when we moved the db
> framework (now dstore) to the OPAL layer in anticipation of the BTLs moving
> to OPAL. The abstraction requirement wouldn't allow me to pass down the
> structure definition.
>

We are talking about nidmap.c which has not yet been moved down to OPAL.

  George.


>
> The easiest solution is probably to change the opal/db/hash code so that
> 64-bit fields are memcpy'd instead of simply passed by "=". This should
> eliminate the problem with the least fuss.
>
> There is a performance penalty for using non-aligned data, and ideally we
> should use aligned data whenever possible. This code isn't in the critical
> path and so this is less of an issue, but still would be nice to do.
> However, I didn't do so for the following reasons:
>
> * I couldn't find a way for the compiler to check/require alignment down
> in opal_db.store when passed a parameter. If someone knows of a way to do
> that, please feel free to suggest it
>
> * none of our current developers have access to a Solaris SPARC machine,
> and thus our developers cannot detect violations when they occur
>
> * the current solution avoids the issue, albeit with a slight performance
> penalty
>
> I'm open to alternative methods - I'm not happy with the ugliness this
> required, but couldn't come up with a cleaner solution that would be easy
> for developers to know when they violated the alignment requirement.
>
> FWIW: it is possible, I suppose, that the other discussion about using an
> opal_process_name_t that exactly mirrors orte_process_name_t could also
> resolve this problem in a cleaner fashion. I didn't impose that requirement
> here, but maybe it's another motivator for doing so?
>
> Ralph
>
>
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>  George,
>
> (one of the) faulty line was :
>
>if (ORTE_SUCCESS != (rc =
> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>
> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>
> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix
> the issue (i have no arch to test...)
>
> i was initially also "confused" with the following line
>
> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc,
> OPAL_SCOPE_INTERNAL,
> ORTE_DB_NPROC_OFFSET,
> &offset, OPAL_UINT32))) {
>
> the first argument of store is an (opal_identifier_t *)
> strictly speaking this is "a pointer to a 64 bits aligned address", and
> proc might not be 64 bits aligned.
> /* that being said, there is no crash :-) */
>
> in this case, opal_db.store pointer points to the store function
> (db_hash.c:178)
> and proc is only used id memcpy at line 194, so 64 bits alignment is not
> required.
> (and comment is explicit : /* to protect alignment, copy the data across
> */
>
> that might sounds pedantic, but are we doing the right thing here ?
> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the
> pointer was not 64 bits aligned
> vs always use aligned data ?)
>
> Cheers,
>
> Gilles
>
> On 2014/08/08 14:58, George Bosilca wrote:
>
> This is a gigantic patch for an almost trivial issue. The current problem
> is purely related to the fact that in a single location (nidmap.c) the
> orte_process_name_t (which is a structure of 2 integers) is supposed to be
> aligned based on the uint64_t requirements. Bad assumption!
>
> Looking at the code one might notice that the orte_process_name_t is stored
> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
> on the SPARC architecture because the two types (int32_t and int64_t) have
> different alignments.  However, ORTE define a type for orte_process_name_t.
> Thus, I think that if instead of saving the orte_process_name_t as an
> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>
>   George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet 
>  wrote:
>
>
>  Kawashima-san and all,
>
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
>
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
>
> the same issue might also be in other parts of the code :-(
>
> Cheers,
>
> Gilles
>
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
>
>  Gilles, George,
>
> The problem is the one Gilles pointed.
> I temporarily modified the code bellow and the bus error disappeared.
>
> --- orte/util/nidmap.c  (revision 32447

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Jeff Squyres (jsquyres)
That'll do it...

George: can you fix?


On Aug 8, 2014, at 1:11 PM, Ralph Castain  wrote:

> I think it might be getting pulled in from this include:
> 
> opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h"
> 
> 
> On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> Weirdness; I don't see any name like that in the SM BTL.
>> 
>> I see it used in the OMPI layer... not sure how it's being using down in the 
>> btl SM component file...?
>> 
>> 
>> On Aug 7, 2014, at 11:25 PM, Paul Hargrove  wrote:
>> 
>>> Testing r32448 on trunk for trac issue #4834, I encounter the following 
>>> which appears unrelated to #4834:
>>> 
>>>  CCLD orte-info
>>> Undefined   first referenced
>>> symbol in file
>>> ompi_proc_local_proc
>>> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
>>> ld: fatal: Symbol referencing errors. No output written to orte-info
>>> 
>>> Note that this is *static* linking.
>>> 
>>> This appears to indicate a call from OPAL to OMPI, and I am guessing this 
>>> is a side-effect of the BTL move.
>>> 
>>> Since OMPI contains (many) calls to OPAL this is a circular library 
>>> dependence.
>>> Unfortunately, some linkers process their argument strictly left-to-right.
>>> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal 
>>> -lmpi" (or similar) to resolve it.
>>> 
>>> -Paul
>>> 
>>> -- 
>>> Paul H. Hargrove  phhargr...@lbl.gov
>>> Future Technologies Group
>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15540.php
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15553.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15562.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Ralph Castain
I think it might be getting pulled in from this include:

opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h"


On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres)  wrote:

> Weirdness; I don't see any name like that in the SM BTL.
> 
> I see it used in the OMPI layer... not sure how it's being using down in the 
> btl SM component file...?
> 
> 
> On Aug 7, 2014, at 11:25 PM, Paul Hargrove  wrote:
> 
>> Testing r32448 on trunk for trac issue #4834, I encounter the following 
>> which appears unrelated to #4834:
>> 
>>  CCLD orte-info
>> Undefined   first referenced
>> symbol in file
>> ompi_proc_local_proc
>> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
>> ld: fatal: Symbol referencing errors. No output written to orte-info
>> 
>> Note that this is *static* linking.
>> 
>> This appears to indicate a call from OPAL to OMPI, and I am guessing this is 
>> a side-effect of the BTL move.
>> 
>> Since OMPI contains (many) calls to OPAL this is a circular library 
>> dependence.
>> Unfortunately, some linkers process their argument strictly left-to-right.
>> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal 
>> -lmpi" (or similar) to resolve it.
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15540.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15553.php



Re: [OMPI devel] Open MPI SVN -> Git (github) conversion

2014-08-08 Thread Jeff Squyres (jsquyres)
Done; thanks.

On Aug 8, 2014, at 11:05 AM, Tim Mattox  wrote:

> Jeff,
> I may someday again be working for an organization that is an Open MPI 
> contributor... so could you
> update my e-mail address in the authors.txt file to be "timattox = Tim Mattox 
> "
> Thanks!
> 
> 
> On Fri, Aug 8, 2014 at 11:00 AM, Jeff Squyres (jsquyres)  
> wrote:
> SHORT VERSION
> =
> 
> Please verify/update the email address that you'd like me to use for your 
> Open MPI commits when we do the git conversion:
> 
> https://github.com/open-mpi/authors
> 
> Updates are due by COB Friday, 15 Aug, 2014 (1 week from today).
> 
> MORE DETAIL
> ===
> 
> Dave and I are continuing to work on the logistics of the SVN -> Git 
> conversion.
> 
> As part of the process, I need email addresses for which you'd like your 
> commits to appear in the git repo.  Please see this git repo for the current 
> list of email addresses that I have, as well as instructions for how to 
> update them:
> 
> https://github.com/open-mpi/authors
> 
> Thanks!
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/1.php
> 
> 
> 
> -- 
> Tim Mattox, Ph.D. - tmat...@gmail.com
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15556.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Ralph Castain
Committed a fix for this in r32460 - see if I got it!

On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet  
wrote:

> Folks,
> 
> here is the description of a hang i briefly mentionned a few days ago.
> 
> with the trunk (i did not check 1.8 ...) simply run on one node :
> mpirun -np 2 --mca btl sm,self ./abort
> 
> (the abort test is taken from the ibm test suite : process 0 call
> MPI_Abort while process 1 enters an infinite loop)
> 
> there is a race condition : sometimes it hangs, sometimes it aborts
> nicely as expected.
> when the hang occurs, both abort processes have exited and mpirun waits
> forever
> 
> i made some investigations and i have now a better idea of what happens
> (but i am still clueless on how to fix this)
> 
> when process 0 abort, it :
> - closes the tcp socket connected to mpirun
> - closes the pipe connected to mpirun
> - send SIGCHLD to mpirun
> 
> then on mpirun :
> when SIGCHLD is received, the handler basically writes 17 (the signal
> number) to a socketpair.
> then libevent will return from a poll and here is the race condition,
> basically :
> if revents is non zero for the three fds (socket, pipe and socketpair)
> then the program will abort nicely
> if revents is non zero for both socket and pipe but is zero for the
> socketpair, then the mpirun will hang
> 
> i digged a bit deeper and found that when the event on the socketpair is
> processed, it will end up calling
> odls_base_default_wait_local_proc.
> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
> will abort nicely
> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
> program will hang
> 
> an other way to put this is that
> when the program aborts nicely, the call sequence is
> odls_base_default_wait_local_proc
> proc_errors(vpid=0)
> proc_errors(vpid=0)
> proc_errors(vpid=1)
> proc_errors(vpid=1)
> 
> when the program hangs, the call sequence is
> proc_errors(vpid=0)
> odls_base_default_wait_local_proc
> proc_errors(vpid=0)
> proc_errors(vpid=1)
> proc_errors(vpid=1)
> 
> i will resume this on Monday unless someone can fix this in the mean
> time :-)
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php



Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Committed a fix for this in r32459 - please check and see if this resolves the 
issue.


On Aug 8, 2014, at 2:21 AM, Ralph Castain  wrote:

> Sorry to chime in a little late. George is likely correct about using 
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
> datatype looks like. This was the original reason for creating the 
> opal_identifier_t type - I had no other choice when we moved the db framework 
> (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. 
> The abstraction requirement wouldn't allow me to pass down the structure 
> definition.
> 
> The easiest solution is probably to change the opal/db/hash code so that 
> 64-bit fields are memcpy'd instead of simply passed by "=". This should 
> eliminate the problem with the least fuss.
> 
> There is a performance penalty for using non-aligned data, and ideally we 
> should use aligned data whenever possible. This code isn't in the critical 
> path and so this is less of an issue, but still would be nice to do. However, 
> I didn't do so for the following reasons:
> 
> * I couldn't find a way for the compiler to check/require alignment down in 
> opal_db.store when passed a parameter. If someone knows of a way to do that, 
> please feel free to suggest it
> 
> * none of our current developers have access to a Solaris SPARC machine, and 
> thus our developers cannot detect violations when they occur
> 
> * the current solution avoids the issue, albeit with a slight performance 
> penalty
> 
> I'm open to alternative methods - I'm not happy with the ugliness this 
> required, but couldn't come up with a cleaner solution that would be easy for 
> developers to know when they violated the alignment requirement.
> 
> FWIW: it is possible, I suppose, that the other discussion about using an 
> opal_process_name_t that exactly mirrors orte_process_name_t could also 
> resolve this problem in a cleaner fashion. I didn't impose that requirement 
> here, but maybe it's another motivator for doing so?
> 
> Ralph
> 
> 
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
>  wrote:
> 
>> George,
>> 
>> (one of the) faulty line was :
>> 
>>if (ORTE_SUCCESS != (rc = 
>> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>> 
>> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>> 
>> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
>> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
>> issue (i have no arch to test...)
>> 
>> i was initially also "confused" with the following line
>> 
>> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
>> OPAL_SCOPE_INTERNAL,
>> ORTE_DB_NPROC_OFFSET, 
>> &offset, OPAL_UINT32))) {
>> 
>> the first argument of store is an (opal_identifier_t *)
>> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
>> might not be 64 bits aligned.
>> /* that being said, there is no crash :-) */
>> 
>> in this case, opal_db.store pointer points to the store function 
>> (db_hash.c:178)
>> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
>> required.
>> (and comment is explicit : /* to protect alignment, copy the data across */
>> 
>> that might sounds pedantic, but are we doing the right thing here ?
>> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
>> pointer was not 64 bits aligned
>> vs always use aligned data ?)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/08 14:58, George Bosilca wrote:
>>> This is a gigantic patch for an almost trivial issue. The current problem
>>> is purely related to the fact that in a single location (nidmap.c) the
>>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>>> aligned based on the uint64_t requirements. Bad assumption!
>>> 
>>> Looking at the code one might notice that the orte_process_name_t is stored
>>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>>> on the SPARC architecture because the two types (int32_t and int64_t) have
>>> different alignments.  However, ORTE define a type for orte_process_name_t.
>>> Thus, I think that if instead of saving the orte_process_name_t as an
>>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>> 
 Kawashima-san and all,
 
 Here is attached a one off patch for v1.8.
 /* it does not use the __attribute__ modifier that might not be
 supported by all compilers */
 
 as far as i am concerned, the same issue is also in the trunk,
 and if you do not hit it, it just means you are lucky :-)
 
 the same issue might also be in other parts of the code :-(
 
 Cheers,
 
 Gilles

Re: [OMPI devel] jenkins error in trunk

2014-08-08 Thread Ralph Castain
Fixed in r32462

On Aug 8, 2014, at 8:13 AM, Mike Dubman  wrote:

> 
> Josh,Devendar - could you please take a look?
> Thanks
> 
> 15:45:00 Making install in mca/coll/fca
> 15:45:00 make[2]: Entering directory 
> `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'
> 15:45:00   CC   coll_fca_module.lo
> 15:45:00 coll_fca_module.c: In function 'have_remote_peers':
> 15:45:00 coll_fca_module.c:48: error: 'ompi_proc_t' has no member named 
> 'proc_flags'
> 15:45:00 coll_fca_module.c:48: error: 'ompi_proc_t' has no member named 
> 'proc_flags'
> 15:45:00 coll_fca_module.c: In function '__get_local_ranks':
> 15:45:00 coll_fca_module.c:75: error: 'ompi_proc_t' has no member named 
> 'proc_flags'
> 15:45:00 coll_fca_module.c:75: error: 'ompi_proc_t' has no member named 
> 'proc_flags'
> 15:45:00 coll_fca_module.c:95: error: 'ompi_proc_t' has no member named 
> 'proc_flags'
> 15:45:00 coll_fca_module.c:95: error: 'ompi_proc_t' has no member named 
> 'proc_flags'
> 15:45:00 make[2]: *** [coll_fca_module.lo] Error 1
> 15:45:00 make[2]: Leaving directory 
> `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'
> 15:45:00 make[1]: *** [install-recursive] Error 1
> 15:45:00 make[1]: Leaving directory 
> `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi'
> 15:45:00 make: *** [install-recursive] Error 1
> 15:45:00 Build step 'Execute shell' marked build as failu
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15557.php



[OMPI devel] jenkins error in trunk

2014-08-08 Thread Mike Dubman
*Josh,Devendar - could you please take a look?*

*Thanks*


*15:45:00* Making install in mca/coll/fca*15:45:00* make[2]: Entering
directory 
`/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'*15:45:00*
  CC   coll_fca_module.lo*15:45:00* coll_fca_module.c: In function
'have_remote_peers':*15:45:00* coll_fca_module.c:48: error:
'ompi_proc_t' has no member named 'proc_flags'*15:45:00*
coll_fca_module.c:48: error: 'ompi_proc_t' has no member named
'proc_flags'*15:45:00* coll_fca_module.c: In function
'__get_local_ranks':*15:45:00* coll_fca_module.c:75: error:
'ompi_proc_t' has no member named 'proc_flags'*15:45:00*
coll_fca_module.c:75: error: 'ompi_proc_t' has no member named
'proc_flags'*15:45:00* coll_fca_module.c:95: error: 'ompi_proc_t' has
no member named 'proc_flags'*15:45:00* coll_fca_module.c:95: error:
'ompi_proc_t' has no member named 'proc_flags'*15:45:00* make[2]: ***
[coll_fca_module.lo] Error 1*15:45:00* make[2]: Leaving directory
`/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'*15:45:00*
make[1]: *** [install-recursive] Error 1*15:45:00* make[1]: Leaving
directory 
`/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi'*15:45:00*
make: *** [install-recursive] Error 1*15:45:00* Build step 'Execute
shell' marked build as failu


Re: [OMPI devel] Open MPI SVN -> Git (github) conversion

2014-08-08 Thread Tim Mattox
Jeff,
I may someday again be working for an organization that is an Open MPI
contributor... so could you
update my e-mail address in the authors.txt file to be "timattox = Tim
Mattox "
Thanks!


On Fri, Aug 8, 2014 at 11:00 AM, Jeff Squyres (jsquyres)  wrote:

> SHORT VERSION
> =
>
> Please verify/update the email address that you'd like me to use for your
> Open MPI commits when we do the git conversion:
>
> https://github.com/open-mpi/authors
>
> Updates are due by COB Friday, 15 Aug, 2014 (1 week from today).
>
> MORE DETAIL
> ===
>
> Dave and I are continuing to work on the logistics of the SVN -> Git
> conversion.
>
> As part of the process, I need email addresses for which you'd like your
> commits to appear in the git repo.  Please see this git repo for the
> current list of email addresses that I have, as well as instructions for
> how to update them:
>
> https://github.com/open-mpi/authors
>
> Thanks!
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/1.php
>



-- 
Tim Mattox, Ph.D. - tmat...@gmail.com


[OMPI devel] Open MPI SVN -> Git (github) conversion

2014-08-08 Thread Jeff Squyres (jsquyres)
SHORT VERSION
=

Please verify/update the email address that you'd like me to use for your Open 
MPI commits when we do the git conversion:

https://github.com/open-mpi/authors

Updates are due by COB Friday, 15 Aug, 2014 (1 week from today).

MORE DETAIL
===

Dave and I are continuing to work on the logistics of the SVN -> Git conversion.

As part of the process, I need email addresses for which you'd like your 
commits to appear in the git repo.  Please see this git repo for the current 
list of email addresses that I have, as well as instructions for how to update 
them:

https://github.com/open-mpi/authors

Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] errors and warnings with show_help() usage

2014-08-08 Thread Jeff Squyres (jsquyres)
SHORT VERSION
=

The ./contrib/check-help-strings.pl script is showing ***47 coding errors*** 
with regards to using show_help() in components.  Here's a summary of the 
offenders:

- ORTE (lumped together because there's a single maintainer :-) )
- smcuda and cuda
- common/verbs
- bcol
- mxm
- openib
- oshmem

Could the owners of these portions of the code base please run 
./contrib/check-help-strings.pl and fix the ERRORs that are shown?

Thanks!

MORE DETAIL
===

The first part of ./contrib/check-help-strings.pl's output shows ERRORs -- 
referring to help files that do not exist, or referring to help topics that do 
not exist.

I'm only calling out the ERRORs in this mail -- but the second part of the 
output shows a bazillion WARNINGs, too.  These are help topics that are 
probably unused -- they don't seem to be referenced by the code anywhere.  

It would be good to clean up all the WARNINGs, too, but the ERRORs are more 
worrisome.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Jeff Squyres (jsquyres)
Weirdness; I don't see any name like that in the SM BTL.

I see it used in the OMPI layer... not sure how it's being using down in the 
btl SM component file...?


On Aug 7, 2014, at 11:25 PM, Paul Hargrove  wrote:

> Testing r32448 on trunk for trac issue #4834, I encounter the following which 
> appears unrelated to #4834:
> 
>   CCLD orte-info
> Undefined   first referenced
>  symbol in file
> ompi_proc_local_proc
> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
> ld: fatal: Symbol referencing errors. No output written to orte-info
> 
> Note that this is *static* linking.
> 
> This appears to indicate a call from OPAL to OMPI, and I am guessing this is 
> a side-effect of the BTL move.
> 
> Since OMPI contains (many) calls to OPAL this is a circular library 
> dependence.
> Unfortunately, some linkers process their argument strictly left-to-right.
> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal 
> -lmpi" (or similar) to resolve it.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15540.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Gilles Gouaillardet
Folks,

here is the description of a hang i briefly mentionned a few days ago.

with the trunk (i did not check 1.8 ...) simply run on one node :
mpirun -np 2 --mca btl sm,self ./abort

(the abort test is taken from the ibm test suite : process 0 call
MPI_Abort while process 1 enters an infinite loop)

there is a race condition : sometimes it hangs, sometimes it aborts
nicely as expected.
when the hang occurs, both abort processes have exited and mpirun waits
forever

i made some investigations and i have now a better idea of what happens
(but i am still clueless on how to fix this)

when process 0 abort, it :
- closes the tcp socket connected to mpirun
- closes the pipe connected to mpirun
- send SIGCHLD to mpirun

then on mpirun :
when SIGCHLD is received, the handler basically writes 17 (the signal
number) to a socketpair.
then libevent will return from a poll and here is the race condition,
basically :
if revents is non zero for the three fds (socket, pipe and socketpair)
then the program will abort nicely
if revents is non zero for both socket and pipe but is zero for the
socketpair, then the mpirun will hang

i digged a bit deeper and found that when the event on the socketpair is
processed, it will end up calling
odls_base_default_wait_local_proc.
if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
will abort nicely
*but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
program will hang

an other way to put this is that
when the program aborts nicely, the call sequence is
odls_base_default_wait_local_proc
proc_errors(vpid=0)
proc_errors(vpid=0)
proc_errors(vpid=1)
proc_errors(vpid=1)

when the program hangs, the call sequence is
proc_errors(vpid=0)
odls_base_default_wait_local_proc
proc_errors(vpid=0)
proc_errors(vpid=1)
proc_errors(vpid=1)

i will resume this on Monday unless someone can fix this in the mean
time :-)

Cheers,

Gilles


Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Sorry to chime in a little late. George is likely correct about using 
ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
datatype looks like. This was the original reason for creating the 
opal_identifier_t type - I had no other choice when we moved the db framework 
(now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. The 
abstraction requirement wouldn't allow me to pass down the structure definition.

The easiest solution is probably to change the opal/db/hash code so that 64-bit 
fields are memcpy'd instead of simply passed by "=". This should eliminate the 
problem with the least fuss.

There is a performance penalty for using non-aligned data, and ideally we 
should use aligned data whenever possible. This code isn't in the critical path 
and so this is less of an issue, but still would be nice to do. However, I 
didn't do so for the following reasons:

* I couldn't find a way for the compiler to check/require alignment down in 
opal_db.store when passed a parameter. If someone knows of a way to do that, 
please feel free to suggest it

* none of our current developers have access to a Solaris SPARC machine, and 
thus our developers cannot detect violations when they occur

* the current solution avoids the issue, albeit with a slight performance 
penalty

I'm open to alternative methods - I'm not happy with the ugliness this 
required, but couldn't come up with a cleaner solution that would be easy for 
developers to know when they violated the alignment requirement.

FWIW: it is possible, I suppose, that the other discussion about using an 
opal_process_name_t that exactly mirrors orte_process_name_t could also resolve 
this problem in a cleaner fashion. I didn't impose that requirement here, but 
maybe it's another motivator for doing so?

Ralph


On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
 wrote:

> George,
> 
> (one of the) faulty line was :
> 
>if (ORTE_SUCCESS != (rc = 
> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
> OPAL_DB_LOCALLDR, 
> (opal_identifier_t*)&proc, OPAL_ID_T))) {
> 
> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
> issue (i have no arch to test...)
> 
> i was initially also "confused" with the following line
> 
> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
> OPAL_SCOPE_INTERNAL,
> ORTE_DB_NPROC_OFFSET, 
> &offset, OPAL_UINT32))) {
> 
> the first argument of store is an (opal_identifier_t *)
> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
> might not be 64 bits aligned.
> /* that being said, there is no crash :-) */
> 
> in this case, opal_db.store pointer points to the store function 
> (db_hash.c:178)
> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
> required.
> (and comment is explicit : /* to protect alignment, copy the data across */
> 
> that might sounds pedantic, but are we doing the right thing here ?
> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
> pointer was not 64 bits aligned
> vs always use aligned data ?)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 14:58, George Bosilca wrote:
>> This is a gigantic patch for an almost trivial issue. The current problem
>> is purely related to the fact that in a single location (nidmap.c) the
>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>> aligned based on the uint64_t requirements. Bad assumption!
>> 
>> Looking at the code one might notice that the orte_process_name_t is stored
>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>> on the SPARC architecture because the two types (int32_t and int64_t) have
>> different alignments.  However, ORTE define a type for orte_process_name_t.
>> Thus, I think that if instead of saving the orte_process_name_t as an
>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>> 
>>   George.
>> 
>> 
>> 
>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> 
>>> Kawashima-san and all,
>>> 
>>> Here is attached a one off patch for v1.8.
>>> /* it does not use the __attribute__ modifier that might not be
>>> supported by all compilers */
>>> 
>>> as far as i am concerned, the same issue is also in the trunk,
>>> and if you do not hit it, it just means you are lucky :-)
>>> 
>>> the same issue might also be in other parts of the code :-(
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
 Gilles, George,
 
 The problem is the one Gilles pointed.
 I temporarily modified the code bellow and the bus error disappeared.
 
 --- orte/util/nidmap.c  (revision 32447)
 +++ orte/util/nidmap.c  (working copy)

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
George,

(one of the) faulty line was :

   if (ORTE_SUCCESS != (rc =
opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,

OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {

so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix
the issue (i have no arch to test...)

i was initially also "confused" with the following line

if (ORTE_SUCCESS != (rc =
opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL,
ORTE_DB_NPROC_OFFSET,
&offset, OPAL_UINT32))) {

the first argument of store is an (opal_identifier_t *)
strictly speaking this is "a pointer to a 64 bits aligned address", and
proc might not be 64 bits aligned.
/* that being said, there is no crash :-) */

in this case, opal_db.store pointer points to the store function
(db_hash.c:178)
and proc is only used id memcpy at line 194, so 64 bits alignment is not
required.
(and comment is explicit :/* to protect alignment, copy the data across */

that might sounds pedantic, but are we doing the right thing here ?
(e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the
pointer was not 64 bits aligned
vs always use aligned data ?)

Cheers,

Gilles

On 2014/08/08 14:58, George Bosilca wrote:
> This is a gigantic patch for an almost trivial issue. The current problem
> is purely related to the fact that in a single location (nidmap.c) the
> orte_process_name_t (which is a structure of 2 integers) is supposed to be
> aligned based on the uint64_t requirements. Bad assumption!
>
> Looking at the code one might notice that the orte_process_name_t is stored
> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
> on the SPARC architecture because the two types (int32_t and int64_t) have
> different alignments.  However, ORTE define a type for orte_process_name_t.
> Thus, I think that if instead of saving the orte_process_name_t as an
> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>
>   George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Kawashima-san and all,
>>
>> Here is attached a one off patch for v1.8.
>> /* it does not use the __attribute__ modifier that might not be
>> supported by all compilers */
>>
>> as far as i am concerned, the same issue is also in the trunk,
>> and if you do not hit it, it just means you are lucky :-)
>>
>> the same issue might also be in other parts of the code :-(
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
>>> Gilles, George,
>>>
>>> The problem is the one Gilles pointed.
>>> I temporarily modified the code bellow and the bus error disappeared.
>>>
>>> --- orte/util/nidmap.c  (revision 32447)
>>> +++ orte/util/nidmap.c  (working copy)
>>> @@ -885,7 +885,7 @@
>>>  orte_proc_state_t state;
>>>  orte_app_idx_t app_idx;
>>>  int32_t restarts;
>>> -orte_process_name_t proc, dmn;
>>> +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
>>>  char *hostname;
>>>  uint8_t flag;
>>>  opal_buffer_t *bptr;
>>>
>>> Takahiro Kawashima,
>>> MPI development team,
>>> Fujitsu
>>>
 Kawashima-san,

 This is interesting :-)

 proc is in the stack and has type orte_process_name_t

 with

 typedef uint32_t orte_jobid_t;
 typedef uint32_t orte_vpid_t;
 struct orte_process_name_t {
 orte_jobid_t jobid; /**< Job number */
 orte_vpid_t vpid;   /**< Process id - equivalent to rank */
 };
 typedef struct orte_process_name_t orte_process_name_t;


 so there is really no reason to align this on 8 bytes...
 but later, proc is casted into an uint64_t ...
 so proc should have been aligned on 8 bytes but it is too late,
 and hence the glory SIGBUS


 this is loosely related to
 http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
 (see heterogeneous.v2.patch)
 if we make opal_process_name_t an union of uint64_t and a struct of two
 uint32_t, the compiler
 will align this on 8 bytes.
 note the patch is not enough (and will not apply on the v1.8 branch
>> anyway),
 we could simply remove orte_process_name_t and ompi_process_name_t and
 use only
 opal_process_name_t (and never declare variables with type
 opal_proc_name_t otherwise alignment might be incorrect)

 as a workaround, you can declare an opal_process_name_t (for alignment),
 and cast it to an orte_process_name_t

 i will write a patch (i will not be able to test on sparc ...)
 please note this issue might be present in other places

 Cheers,

 Gilles

 On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> Hi,
>
>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>> 10 Sparc and I receive a 

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
This is a gigantic patch for an almost trivial issue. The current problem
is purely related to the fact that in a single location (nidmap.c) the
orte_process_name_t (which is a structure of 2 integers) is supposed to be
aligned based on the uint64_t requirements. Bad assumption!

Looking at the code one might notice that the orte_process_name_t is stored
using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
on the SPARC architecture because the two types (int32_t and int64_t) have
different alignments.  However, ORTE define a type for orte_process_name_t.
Thus, I think that if instead of saving the orte_process_name_t as an
OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.

  George.



On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Kawashima-san and all,
>
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
>
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
>
> the same issue might also be in other parts of the code :-(
>
> Cheers,
>
> Gilles
>
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> > Gilles, George,
> >
> > The problem is the one Gilles pointed.
> > I temporarily modified the code bellow and the bus error disappeared.
> >
> > --- orte/util/nidmap.c  (revision 32447)
> > +++ orte/util/nidmap.c  (working copy)
> > @@ -885,7 +885,7 @@
> >  orte_proc_state_t state;
> >  orte_app_idx_t app_idx;
> >  int32_t restarts;
> > -orte_process_name_t proc, dmn;
> > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
> >  char *hostname;
> >  uint8_t flag;
> >  opal_buffer_t *bptr;
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Kawashima-san,
> >>
> >> This is interesting :-)
> >>
> >> proc is in the stack and has type orte_process_name_t
> >>
> >> with
> >>
> >> typedef uint32_t orte_jobid_t;
> >> typedef uint32_t orte_vpid_t;
> >> struct orte_process_name_t {
> >> orte_jobid_t jobid; /**< Job number */
> >> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> >> };
> >> typedef struct orte_process_name_t orte_process_name_t;
> >>
> >>
> >> so there is really no reason to align this on 8 bytes...
> >> but later, proc is casted into an uint64_t ...
> >> so proc should have been aligned on 8 bytes but it is too late,
> >> and hence the glory SIGBUS
> >>
> >>
> >> this is loosely related to
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> >> (see heterogeneous.v2.patch)
> >> if we make opal_process_name_t an union of uint64_t and a struct of two
> >> uint32_t, the compiler
> >> will align this on 8 bytes.
> >> note the patch is not enough (and will not apply on the v1.8 branch
> anyway),
> >> we could simply remove orte_process_name_t and ompi_process_name_t and
> >> use only
> >> opal_process_name_t (and never declare variables with type
> >> opal_proc_name_t otherwise alignment might be incorrect)
> >>
> >> as a workaround, you can declare an opal_process_name_t (for alignment),
> >> and cast it to an orte_process_name_t
> >>
> >> i will write a patch (i will not be able to test on sparc ...)
> >> please note this issue might be present in other places
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> >>> Hi,
> >>>
>  I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>  10 Sparc and I receive a bus error, if I run a small program.
> >>> I've finally reproduced the bus error in my SPARC environment.
> >>>
> >>> #0 0x00db4740 (__waitpid_nocancel + 0x44)
> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct
> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in
> ../sigattach.c 
> >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *)
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line
> 252 in db_hash.c
> >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long
> *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line
> 49 in db_base_fns.c
> >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
> 0x00281d70) at line 975 in nidmap.c
> >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in
> ess_env_module.c
> >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *)
> 0x,pargv=(char ***) 0x,flags=32) at line
> 148 in orte_init.c
> >>> #8 0x001a6f08 (ompi_mpi_ini

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles,

I applied your patch to v1.8 and it run successfully
on my SPARC machines.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san and all,
> 
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
> 
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
> 
> the same issue might also be in other parts of the code :-(
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> > Gilles, George,
> >
> > The problem is the one Gilles pointed.
> > I temporarily modified the code bellow and the bus error disappeared.
> >
> > --- orte/util/nidmap.c  (revision 32447)
> > +++ orte/util/nidmap.c  (working copy)
> > @@ -885,7 +885,7 @@
> >  orte_proc_state_t state;
> >  orte_app_idx_t app_idx;
> >  int32_t restarts;
> > -orte_process_name_t proc, dmn;
> > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
> >  char *hostname;
> >  uint8_t flag;
> >  opal_buffer_t *bptr;
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Kawashima-san,
> >>
> >> This is interesting :-)
> >>
> >> proc is in the stack and has type orte_process_name_t
> >>
> >> with
> >>
> >> typedef uint32_t orte_jobid_t;
> >> typedef uint32_t orte_vpid_t;
> >> struct orte_process_name_t {
> >> orte_jobid_t jobid; /**< Job number */
> >> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> >> };
> >> typedef struct orte_process_name_t orte_process_name_t;
> >>
> >>
> >> so there is really no reason to align this on 8 bytes...
> >> but later, proc is casted into an uint64_t ...
> >> so proc should have been aligned on 8 bytes but it is too late,
> >> and hence the glory SIGBUS
> >>
> >>
> >> this is loosely related to
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> >> (see heterogeneous.v2.patch)
> >> if we make opal_process_name_t an union of uint64_t and a struct of two
> >> uint32_t, the compiler
> >> will align this on 8 bytes.
> >> note the patch is not enough (and will not apply on the v1.8 branch 
> >> anyway),
> >> we could simply remove orte_process_name_t and ompi_process_name_t and
> >> use only
> >> opal_process_name_t (and never declare variables with type
> >> opal_proc_name_t otherwise alignment might be incorrect)
> >>
> >> as a workaround, you can declare an opal_process_name_t (for alignment),
> >> and cast it to an orte_process_name_t
> >>
> >> i will write a patch (i will not be able to test on sparc ...)
> >> please note this issue might be present in other places
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> >>> Hi,
> >>>
>  I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>  10 Sparc and I receive a bus error, if I run a small program.
> >>> I've finally reproduced the bus error in my SPARC environment.
> >>>
> >>> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> >>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
> >>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 
> >>> in ../sigattach.c 
> >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> >>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
> >>> 252 in db_hash.c
> >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> >>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at 
> >>> line 49 in db_base_fns.c
> >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> >>> 0x00281d70) at line 975 in nidmap.c
> >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> >>> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in 
> >>> ess_env_module.c
> >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> >>> 0x,pargv=(char ***) 0x,flags=32) at line 
> >>> 148 in orte_init.c
> >>> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> >>> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at 
> >>> line 464 in ompi_mpi_init.c
> >>> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> >>> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in 
> >>> init.c
> >>> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> >>> 0x07fef348) at line 8 in mpiinitfinalize.c
> >>> #11 0x00d2b81c (__libc_start_main + 0x194) 
> >>> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> >>> #12 0x0010094c (_start + 0x2c) ()
> >>>
> >>> The line 2

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san and all,

Here is attached a one off patch for v1.8.
/* it does not use the __attribute__ modifier that might not be
supported by all compilers */

as far as i am concerned, the same issue is also in the trunk,
and if you do not hit it, it just means you are lucky :-)

the same issue might also be in other parts of the code :-(

Cheers,

Gilles

On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> Gilles, George,
>
> The problem is the one Gilles pointed.
> I temporarily modified the code bellow and the bus error disappeared.
>
> --- orte/util/nidmap.c  (revision 32447)
> +++ orte/util/nidmap.c  (working copy)
> @@ -885,7 +885,7 @@
>  orte_proc_state_t state;
>  orte_app_idx_t app_idx;
>  int32_t restarts;
> -orte_process_name_t proc, dmn;
> +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
>  char *hostname;
>  uint8_t flag;
>  opal_buffer_t *bptr;
>
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
>> Kawashima-san,
>>
>> This is interesting :-)
>>
>> proc is in the stack and has type orte_process_name_t
>>
>> with
>>
>> typedef uint32_t orte_jobid_t;
>> typedef uint32_t orte_vpid_t;
>> struct orte_process_name_t {
>> orte_jobid_t jobid; /**< Job number */
>> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
>> };
>> typedef struct orte_process_name_t orte_process_name_t;
>>
>>
>> so there is really no reason to align this on 8 bytes...
>> but later, proc is casted into an uint64_t ...
>> so proc should have been aligned on 8 bytes but it is too late,
>> and hence the glory SIGBUS
>>
>>
>> this is loosely related to
>> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
>> (see heterogeneous.v2.patch)
>> if we make opal_process_name_t an union of uint64_t and a struct of two
>> uint32_t, the compiler
>> will align this on 8 bytes.
>> note the patch is not enough (and will not apply on the v1.8 branch anyway),
>> we could simply remove orte_process_name_t and ompi_process_name_t and
>> use only
>> opal_process_name_t (and never declare variables with type
>> opal_proc_name_t otherwise alignment might be incorrect)
>>
>> as a workaround, you can declare an opal_process_name_t (for alignment),
>> and cast it to an orte_process_name_t
>>
>> i will write a patch (i will not be able to test on sparc ...)
>> please note this issue might be present in other places
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
>>> Hi,
>>>
 I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
 10 Sparc and I receive a bus error, if I run a small program.
>>> I've finally reproduced the bus error in my SPARC environment.
>>>
>>> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
>>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
>>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
>>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
>>> ../sigattach.c 
>>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
>>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
>>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
>>> 252 in db_hash.c
>>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
>>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
>>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
>>> 49 in db_base_fns.c
>>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
>>> 0x00281d70) at line 975 in nidmap.c
>>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
>>> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
>>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
>>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
>>> 0x,pargv=(char ***) 0x,flags=32) at line 
>>> 148 in orte_init.c
>>> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
>>> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
>>> 464 in ompi_mpi_init.c
>>> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
>>> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
>>> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
>>> 0x07fef348) at line 8 in mpiinitfinalize.c
>>> #11 0x00d2b81c (__libc_start_main + 0x194) 
>>> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
>>> #12 0x0010094c (_start + 0x2c) ()
>>>
>>> The line 252 in opal/mca/db/hash/db_hash.c is:
>>>
>>> case OPAL_UINT64:
>>> if (NULL == data) {
>>> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
>>> return OPAL_ERR_BAD_PARAM;
>>> }
>>> kv->type = OPAL_UINT64;
>>> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
>>> break;
>>>
>>> My environment is:
>>>
>>>   Open MPI v1.8 branch r32447 (lates

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles, George,

The problem is the one Gilles pointed.
I temporarily modified the code bellow and the bus error disappeared.

--- orte/util/nidmap.c  (revision 32447)
+++ orte/util/nidmap.c  (working copy)
@@ -885,7 +885,7 @@
 orte_proc_state_t state;
 orte_app_idx_t app_idx;
 int32_t restarts;
-orte_process_name_t proc, dmn;
+orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
 char *hostname;
 uint8_t flag;
 opal_buffer_t *bptr;

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san,
> 
> This is interesting :-)
> 
> proc is in the stack and has type orte_process_name_t
> 
> with
> 
> typedef uint32_t orte_jobid_t;
> typedef uint32_t orte_vpid_t;
> struct orte_process_name_t {
> orte_jobid_t jobid; /**< Job number */
> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> };
> typedef struct orte_process_name_t orte_process_name_t;
> 
> 
> so there is really no reason to align this on 8 bytes...
> but later, proc is casted into an uint64_t ...
> so proc should have been aligned on 8 bytes but it is too late,
> and hence the glory SIGBUS
> 
> 
> this is loosely related to
> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> (see heterogeneous.v2.patch)
> if we make opal_process_name_t an union of uint64_t and a struct of two
> uint32_t, the compiler
> will align this on 8 bytes.
> note the patch is not enough (and will not apply on the v1.8 branch anyway),
> we could simply remove orte_process_name_t and ompi_process_name_t and
> use only
> opal_process_name_t (and never declare variables with type
> opal_proc_name_t otherwise alignment might be incorrect)
> 
> as a workaround, you can declare an opal_process_name_t (for alignment),
> and cast it to an orte_process_name_t
> 
> i will write a patch (i will not be able to test on sparc ...)
> please note this issue might be present in other places
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> > Hi,
> >
> >> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> >> 10 Sparc and I receive a bus error, if I run a small program.
> > I've finally reproduced the bus error in my SPARC environment.
> >
> > #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
> > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
> > ../sigattach.c 
> > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
> > 252 in db_hash.c
> > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
> > 49 in db_base_fns.c
> > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> > 0x00281d70) at line 975 in nidmap.c
> > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> > 0x,pargv=(char ***) 0x,flags=32) at line 
> > 148 in orte_init.c
> > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
> > 464 in ompi_mpi_init.c
> > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> > 0x07fef348) at line 8 in mpiinitfinalize.c
> > #11 0x00d2b81c (__libc_start_main + 0x194) 
> > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> > #12 0x0010094c (_start + 0x2c) ()
> >
> > The line 252 in opal/mca/db/hash/db_hash.c is:
> >
> > case OPAL_UINT64:
> > if (NULL == data) {
> > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> > return OPAL_ERR_BAD_PARAM;
> > }
> > kv->type = OPAL_UINT64;
> > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> > break;
> >
> > My environment is:
> >
> >   Open MPI v1.8 branch r32447 (latest)
> >   configure --enable-debug
> >   SPARC-V9 (Fujitsu SPARC64 IXfx)
> >   Linux (custom)
> >   gcc 4.2.4
> >
> > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
> >
> > Can this information help?
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Hi,
> >>
> >> I'm sorry once more to answer late, but the last two days our mail
> >> server was down (hardware error).
> >>
> >>> Did you configure this --enable-debug?
> >> Yes, I 

Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-08 Thread George Bosilca
Paul's tests identified an small issue with the previous patch (a real
corner-case for ARM v5). The patch below is fixing all known issues.

Btw, there is still room for volunteers for the .asm work.

  George.



On Tue, Aug 5, 2014 at 2:23 PM, George Bosilca  wrote:

> Thanks to Paul help all the inlined atomics have been tested. The new
> patch is attached below. However, this only fixes the inline atomics, all
> those generated from the *.asm files have not been updated. Any volunteer?
>
>   George.
>
>
>
> On Aug 1, 2014, at 18:09 , Paul Hargrove  wrote:
>
> I have confirmed that George's latest version works on both SPARC ABIs.
>
> ARMv7 and three MIPS ABIs still pending...
>
> -Paul
>
>
> On Fri, Aug 1, 2014 at 9:40 AM, George Bosilca 
> wrote:
>
>> Another version of the atomic patch. Paul has tested it on a bunch of
>> platforms. At this point we have confirmation from all architectures except
>> SPARC (v8+ and v9).
>>
>>   George.
>>
>>
>>
>> On Jul 31, 2014, at 19:13 , George Bosilca  wrote:
>>
>> > All,
>> >
>> > Here is the patch that change the meaning of the atomics to make them
>> always return the previous value (similar to sync_fetch_and_<*>). I tested
>> this with the following atomics: OS X, gcc style intrinsics and AMD64.
>> >
>> > I did not change the base assembly files used when GCC style assembly
>> operations are not supported. If someone feels like fixing them, feel free.
>> >
>> > Paul, I know you have a pretty diverse range computers. Can you try to
>> compile and run a “make check” with the following patch?
>> >
>> >  George.
>> >
>> > 
>> >
>> > On Jul 30, 2014, at 15:21 , Nathan Hjelm  wrote:
>> >
>> >>
>> >> That is what I would prefer. I was trying to not disturb things too
>> >> much :). Please bring the changes over!
>> >>
>> >> -Nathan
>> >>
>> >> On Wed, Jul 30, 2014 at 03:18:44PM -0400, George Bosilca wrote:
>> >>>  Why do you want to add new versions? This will lead to having two,
>> almost
>> >>>  identical, sets of atomics that are conceptually equivalent but
>> different
>> >>>  in terms of code. And we will have to maintained both!
>> >>>  I did a similar change in a fork of OPAL in another project but
>> instead of
>> >>>  adding another flavor of atomics, I completely replaced the
>> available ones
>> >>>  with a set returning the old value. I can bring the code over.
>> >>>George.
>> >>>
>> >>>  On Tue, Jul 29, 2014 at 5:29 PM, Paul Hargrove 
>> wrote:
>> >>>
>> >>>On Tue, Jul 29, 2014 at 2:10 PM, Nathan Hjelm 
>> wrote:
>> >>>
>> >>>  Is there a reason why the
>> >>>  current implementations of opal atomics (add, cmpset) do not
>> return
>> >>>  the
>> >>>  old value?
>> >>>
>> >>>Because some CPUs don't implement such an atomic instruction?
>> >>>
>> >>>On any CPU one *can* certainly synthesize the desired operation
>> with an
>> >>>added read before the compare-and-swap to return a value that was
>> >>>present at some time before a failed cmpset.  That is almost
>> certainly
>> >>>sufficient for your purposes.  However, the added load makes it
>> >>>(marginally) more expensive on some CPUs that only have the native
>> >>>equivalent of gcc's __sync_bool_compare_and_swap().
>>
>


atomics.patch
Description: Binary data


Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi George,

> Takahiro you can confirm this by printing the value of data when signal is
> raised.

It's in the trace.
0x07fede74

#2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
in db_hash.c

I want to dig this issue, but unfortunately I have no time today.
My SPARC machines stop one hour later for the maintenance...

Takahiro Kawashima,
MPI development team,
Fujitsu

> I have an extremely vague recollection about a similar issue in the
> datatype engine: on the SPARC architecture the 64 bits integers must be
> aligned on a 64bits boundary or you get a bus error.
> 
> Takahiro you can confirm this by printing the value of data when signal is
> raised.
> 
> George.
> 
> 
> 
> On Fri, Aug 8, 2014 at 12:03 AM, Kawashima, Takahiro <
> t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi,
> >
> > > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > > > >>> 10 Sparc and I receive a bus error, if I run a small program.
> >
> > I've finally reproduced the bus error in my SPARC environment.
> >
> > #0 0x00db4740 (__waitpid_nocancel + 0x44)
> > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct
> > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in
> > ../sigattach.c 
> > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *)
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line
> > 252 in db_hash.c
> > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *)
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line
> > 49 in db_base_fns.c
> > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
> > 0x00281d70) at line 975 in nidmap.c
> > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
> > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *)
> > 0x,pargv=(char ***) 0x,flags=32) at line
> > 148 in orte_init.c
> > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **)
> > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line
> > 464 in ompi_mpi_init.c
> > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *)
> > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **)
> > 0x07fef348) at line 8 in mpiinitfinalize.c
> > #11 0x00d2b81c (__libc_start_main + 0x194)
> > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> > #12 0x0010094c (_start + 0x2c) ()
> >
> > The line 252 in opal/mca/db/hash/db_hash.c is:
> >
> > case OPAL_UINT64:
> > if (NULL == data) {
> > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> > return OPAL_ERR_BAD_PARAM;
> > }
> > kv->type = OPAL_UINT64;
> > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> > break;
> >
> > My environment is:
> >
> >   Open MPI v1.8 branch r32447 (latest)
> >   configure --enable-debug
> >   SPARC-V9 (Fujitsu SPARC64 IXfx)
> >   Linux (custom)
> >   gcc 4.2.4
> >
> > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
> >
> > Can this information help?
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> > > Hi,
> > >
> > > I'm sorry once more to answer late, but the last two days our mail
> > > server was down (hardware error).
> > >
> > > > Did you configure this --enable-debug?
> > >
> > > Yes, I used the following command.
> > >
> > > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
> > >   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
> > >   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
> > >   --with-jdk-headers=/usr/local/jdk1.8.0/include \
> > >   JAVA_HOME=/usr/local/jdk1.8.0 \
> > >   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
> > >   CC="gcc" CXX="g++" FC="gfortran" \
> > >   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
> > >   CPP="cpp" CXXCPP="cpp" \
> > >   CPPFLAGS="" CXXCPPFLAGS="" \
> > >   --enable-mpi-cxx \
> > >   --enable-cxx-exceptions \
> > >   --enable-mpi-java \
> > >   --enable-heterogeneous \
> > >   --enable-mpi-thread-multiple \
> > >   --with-threads=posix \
> > >   --with-hwloc=internal \
> > >   --without-verbs \
> > >   --with-wrapper-cflags="-std=c11 -m64" \
> > >   --enable-debug \
> > >   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> > >
> > >
> > >
> > > > If so, you should get a line number in the backtrace
> > >
> > > I got them for gdb (see below)

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san,

This is interesting :-)

proc is in the stack and has type orte_process_name_t

with

typedef uint32_t orte_jobid_t;
typedef uint32_t orte_vpid_t;
struct orte_process_name_t {
orte_jobid_t jobid; /**< Job number */
orte_vpid_t vpid;   /**< Process id - equivalent to rank */
};
typedef struct orte_process_name_t orte_process_name_t;


so there is really no reason to align this on 8 bytes...
but later, proc is casted into an uint64_t ...
so proc should have been aligned on 8 bytes but it is too late,
and hence the glory SIGBUS


this is loosely related to
http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
(see heterogeneous.v2.patch)
if we make opal_process_name_t an union of uint64_t and a struct of two
uint32_t, the compiler
will align this on 8 bytes.
note the patch is not enough (and will not apply on the v1.8 branch anyway),
we could simply remove orte_process_name_t and ompi_process_name_t and
use only
opal_process_name_t (and never declare variables with type
opal_proc_name_t otherwise alignment might be incorrect)

as a workaround, you can declare an opal_process_name_t (for alignment),
and cast it to an orte_process_name_t

i will write a patch (i will not be able to test on sparc ...)
please note this issue might be present in other places

Cheers,

Gilles

On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> Hi,
>
>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>> 10 Sparc and I receive a bus error, if I run a small program.
> I've finally reproduced the bus error in my SPARC environment.
>
> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo 
> *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
> ../sigattach.c 
> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
> in db_hash.c
> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
> 49 in db_base_fns.c
> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> 0x00281d70) at line 975 in nidmap.c
> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> 0x,pargv=(char ***) 0x,flags=32) at line 148 
> in orte_init.c
> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
> 464 in ompi_mpi_init.c
> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> 0x07fef348) at line 8 in mpiinitfinalize.c
> #11 0x00d2b81c (__libc_start_main + 0x194) 
> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> #12 0x0010094c (_start + 0x2c) ()
>
> The line 252 in opal/mca/db/hash/db_hash.c is:
>
> case OPAL_UINT64:
> if (NULL == data) {
> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> return OPAL_ERR_BAD_PARAM;
> }
> kv->type = OPAL_UINT64;
> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> break;
>
> My environment is:
>
>   Open MPI v1.8 branch r32447 (latest)
>   configure --enable-debug
>   SPARC-V9 (Fujitsu SPARC64 IXfx)
>   Linux (custom)
>   gcc 4.2.4
>
> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
>
> Can this information help?
>
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
>> Hi,
>>
>> I'm sorry once more to answer late, but the last two days our mail
>> server was down (hardware error).
>>
>>> Did you configure this --enable-debug?
>> Yes, I used the following command.
>>
>> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>>   JAVA_HOME=/usr/local/jdk1.8.0 \
>>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>>   CC="gcc" CXX="g++" FC="gfortran" \
>>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>   CPP="cpp" CXXCPP="cpp" \
>>   CPPFLAGS="" CXXCPPFLAGS="" \
>>   --enable-mpi-cxx \
>>   --enable-cxx-exceptions \
>>   --enable-mpi-java \
>>   --enable-heterogeneous \
>>   --enable-mpi-thread-multiple \
>>   --with-threads=posix \
>>   --with-hwloc=internal \
>>   --without-verbs \
>>   --with-wrapper-cflags="-

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
I have an extremely vague recollection about a similar issue in the
datatype engine: on the SPARC architecture the 64 bits integers must be
aligned on a 64bits boundary or you get a bus error.

Takahiro you can confirm this by printing the value of data when signal is
raised.

George.



On Fri, Aug 8, 2014 at 12:03 AM, Kawashima, Takahiro <
t-kawash...@jp.fujitsu.com> wrote:

> Hi,
>
> > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > > >>> 10 Sparc and I receive a bus error, if I run a small program.
>
> I've finally reproduced the bus error in my SPARC environment.
>
> #0 0x00db4740 (__waitpid_nocancel + 0x44)
> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct
> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in
> ../sigattach.c 
> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *)
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line
> 252 in db_hash.c
> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *)
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line
> 49 in db_base_fns.c
> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
> 0x00281d70) at line 975 in nidmap.c
> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *)
> 0x,pargv=(char ***) 0x,flags=32) at line
> 148 in orte_init.c
> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **)
> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line
> 464 in ompi_mpi_init.c
> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *)
> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **)
> 0x07fef348) at line 8 in mpiinitfinalize.c
> #11 0x00d2b81c (__libc_start_main + 0x194)
> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> #12 0x0010094c (_start + 0x2c) ()
>
> The line 252 in opal/mca/db/hash/db_hash.c is:
>
> case OPAL_UINT64:
> if (NULL == data) {
> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> return OPAL_ERR_BAD_PARAM;
> }
> kv->type = OPAL_UINT64;
> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> break;
>
> My environment is:
>
>   Open MPI v1.8 branch r32447 (latest)
>   configure --enable-debug
>   SPARC-V9 (Fujitsu SPARC64 IXfx)
>   Linux (custom)
>   gcc 4.2.4
>
> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
>
> Can this information help?
>
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
> > Hi,
> >
> > I'm sorry once more to answer late, but the last two days our mail
> > server was down (hardware error).
> >
> > > Did you configure this --enable-debug?
> >
> > Yes, I used the following command.
> >
> > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
> >   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
> >   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
> >   --with-jdk-headers=/usr/local/jdk1.8.0/include \
> >   JAVA_HOME=/usr/local/jdk1.8.0 \
> >   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
> >   CC="gcc" CXX="g++" FC="gfortran" \
> >   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
> >   CPP="cpp" CXXCPP="cpp" \
> >   CPPFLAGS="" CXXCPPFLAGS="" \
> >   --enable-mpi-cxx \
> >   --enable-cxx-exceptions \
> >   --enable-mpi-java \
> >   --enable-heterogeneous \
> >   --enable-mpi-thread-multiple \
> >   --with-threads=posix \
> >   --with-hwloc=internal \
> >   --without-verbs \
> >   --with-wrapper-cflags="-std=c11 -m64" \
> >   --enable-debug \
> >   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> >
> >
> >
> > > If so, you should get a line number in the backtrace
> >
> > I got them for gdb (see below), but not for "dbx".
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> >
> > >
> > >
> > > On Aug 5, 2014, at 2:59 AM, Siegmar Gross
> >  wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm sorry to answer so late, but last week I didn't have Internet
> > > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get
> > > > the same error.
> > > >
> > > >> This looks like the typical type of alignment error that we used
> > > >> to see when testing regularly on SPARC.  :-\
> > > >>
> > > >> It looks like the error was happening in mca_db_hash.so.  Could
> > > >> you get a stack trace / file+line number where it was failing
> > > >> in mca_db_hash?  (i.e., the actual bad code will likely be under
> > > >> opal/mca/db/hash somewhere)
> > > >
> > > > Unfortunat

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi,

> > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > >>> 10 Sparc and I receive a bus error, if I run a small program.

I've finally reproduced the bus error in my SPARC environment.

#0 0x00db4740 (__waitpid_nocancel + 0x44) 
(0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
#1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo *) 
0x07fed100,p=(void *) 0x07fed100) at line 277 in ../sigattach.c 

#2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
in db_hash.c
#3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 49 
in db_base_fns.c
#4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
0x00281d70) at line 975 in nidmap.c
#5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
#6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
#7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
0x,pargv=(char ***) 0x,flags=32) at line 148 in 
orte_init.c
#8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 464 
in ompi_mpi_init.c
#9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
#10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 0x07fef348) 
at line 8 in mpiinitfinalize.c
#11 0x00d2b81c (__libc_start_main + 0x194) 
(0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
#12 0x0010094c (_start + 0x2c) ()

The line 252 in opal/mca/db/hash/db_hash.c is:

case OPAL_UINT64:
if (NULL == data) {
OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
return OPAL_ERR_BAD_PARAM;
}
kv->type = OPAL_UINT64;
kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
break;

My environment is:

  Open MPI v1.8 branch r32447 (latest)
  configure --enable-debug
  SPARC-V9 (Fujitsu SPARC64 IXfx)
  Linux (custom)
  gcc 4.2.4

I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.

Can this information help?

Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi,
> 
> I'm sorry once more to answer late, but the last two days our mail
> server was down (hardware error).
> 
> > Did you configure this --enable-debug?
> 
> Yes, I used the following command.
> 
> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>   JAVA_HOME=/usr/local/jdk1.8.0 \
>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>   CC="gcc" CXX="g++" FC="gfortran" \
>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   CPPFLAGS="" CXXCPPFLAGS="" \
>   --enable-mpi-cxx \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-threads=posix \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-std=c11 -m64" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> 
> 
> 
> > If so, you should get a line number in the backtrace
> 
> I got them for gdb (see below), but not for "dbx".
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> > 
> > 
> > On Aug 5, 2014, at 2:59 AM, Siegmar Gross 
>  wrote:
> > 
> > > Hi,
> > > 
> > > I'm sorry to answer so late, but last week I didn't have Internet
> > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get
> > > the same error.
> > > 
> > >> This looks like the typical type of alignment error that we used
> > >> to see when testing regularly on SPARC.  :-\
> > >> 
> > >> It looks like the error was happening in mca_db_hash.so.  Could
> > >> you get a stack trace / file+line number where it was failing
> > >> in mca_db_hash?  (i.e., the actual bad code will likely be under
> > >> opal/mca/db/hash somewhere)
> > > 
> > > Unfortunately I don't get a file+line number from a file in
> > > opal/mca/db/Hash.
> > > 
> > > 
> > > 
> > > tyr small_prog 102 ompi_info | grep MPI:
> > >Open MPI: 1.8.2rc3
> > > tyr small_prog 103 which mpicc
> > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
> > > tyr small_prog 104 mpicc init_finalize.c 
> > > tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec 
> > > For information about new features see `help changes'
> > > To remove this message, put `dbxenv suppress_startup_message 7.9' in your 
> .dbxrc
> > > Reading mp