Re: [OMPI devel] v1.8.2 still held up...
On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain wrote: > * fixes to coll/ml that expanded to fixing page alignment in general - > someone needs to review/approve it: > https://svn.open-mpi.org/trac/ompi/ticket/4826 > I've been able to confirm that the nightly tarball (1.8.2rc4r32480) works as expected on the SPARC and PPC64 platforms where I had reproduced the problem previously. I won't have access to the IA64 platform (which also has pagesize != 4K) until about 6 hours from now, but have no doubt the fix will work there too. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] v1.8.2 still held up...
On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain wrote: > * static linking failure - Gilles has posted a proposed fix, but somebody > needs to approve and CMR it. Please see: > https://svn.open-mpi.org/trac/ompi/ticket/4834 > Jeff moved the fix to v1.8 in r32471. I have tested tonight's tarball (1.8.2rc4r32480) and found the problem to be resolved on all tested OSes (linux, macos, freebsd, netbsd, openbsd, solaris-10 and solaris-11). -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value
I will try to take a look this week and see what I can do. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of George Bosilca [bosi...@icl.utk.edu] Sent: Thursday, August 07, 2014 10:37 PM To: Open MPI Developers Subject: Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value Paul's tests identified an small issue with the previous patch (a real corner-case for ARM v5). The patch below is fixing all known issues. Btw, there is still room for volunteers for the .asm work. George. On Tue, Aug 5, 2014 at 2:23 PM, George Bosilca mailto:bosi...@icl.utk.edu>> wrote: Thanks to Paul help all the inlined atomics have been tested. The new patch is attached below. However, this only fixes the inline atomics, all those generated from the *.asm files have not been updated. Any volunteer? George. On Aug 1, 2014, at 18:09 , Paul Hargrove mailto:phhargr...@lbl.gov>> wrote: I have confirmed that George's latest version works on both SPARC ABIs. ARMv7 and three MIPS ABIs still pending... -Paul On Fri, Aug 1, 2014 at 9:40 AM, George Bosilca mailto:bosi...@icl.utk.edu>> wrote: Another version of the atomic patch. Paul has tested it on a bunch of platforms. At this point we have confirmation from all architectures except SPARC (v8+ and v9). George. On Jul 31, 2014, at 19:13 , George Bosilca mailto:bosi...@icl.utk.edu>> wrote: > All, > > Here is the patch that change the meaning of the atomics to make them always > return the previous value (similar to sync_fetch_and_<*>). I tested this with > the following atomics: OS X, gcc style intrinsics and AMD64. > > I did not change the base assembly files used when GCC style assembly > operations are not supported. If someone feels like fixing them, feel free. > > Paul, I know you have a pretty diverse range computers. Can you try to > compile and run a “make check” with the following patch? > > George. > > > > On Jul 30, 2014, at 15:21 , Nathan Hjelm > mailto:hje...@lanl.gov>> wrote: > >> >> That is what I would prefer. I was trying to not disturb things too >> much :). Please bring the changes over! >> >> -Nathan >> >> On Wed, Jul 30, 2014 at 03:18:44PM -0400, George Bosilca wrote: >>> Why do you want to add new versions? This will lead to having two, almost >>> identical, sets of atomics that are conceptually equivalent but different >>> in terms of code. And we will have to maintained both! >>> I did a similar change in a fork of OPAL in another project but instead of >>> adding another flavor of atomics, I completely replaced the available ones >>> with a set returning the old value. I can bring the code over. >>>George. >>> >>> On Tue, Jul 29, 2014 at 5:29 PM, Paul Hargrove >>> mailto:phhargr...@lbl.gov>> wrote: >>> >>>On Tue, Jul 29, 2014 at 2:10 PM, Nathan Hjelm >>> mailto:hje...@lanl.gov>> wrote: >>> >>> Is there a reason why the >>> current implementations of opal atomics (add, cmpset) do not return >>> the >>> old value? >>> >>>Because some CPUs don't implement such an atomic instruction? >>> >>>On any CPU one *can* certainly synthesize the desired operation with an >>>added read before the compare-and-swap to return a value that was >>>present at some time before a failed cmpset. That is almost certainly >>>sufficient for your purposes. However, the added load makes it >>>(marginally) more expensive on some CPUs that only have the native >>>equivalent of gcc's __sync_bool_compare_and_swap().
Re: [OMPI devel] ORTE headers in OPAL source
These are harmless. They are only used when FT is enabled which should rarely be the case. George. On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) wrote: > Here's a few ORTE headers in OPAL source -- can respective owners clean > these up? Thanks. > > - > mca/btl/smcuda/btl_smcuda.c > 63:#include "orte/mca/sstore/sstore.h" > > mca/btl/sm/btl_sm.c > 62:#include "orte/mca/sstore/sstore.h" > > mca/mpool/sm/mpool_sm_module.c > 34:#include "orte/mca/sstore/sstore.h" > - > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php >
[OMPI devel] ORTE headers in OPAL source
Here's a few ORTE headers in OPAL source -- can respective owners clean these up? Thanks. - mca/btl/smcuda/btl_smcuda.c 63:#include "orte/mca/sstore/sstore.h" mca/btl/sm/btl_sm.c 62:#include "orte/mca/sstore/sstore.h" mca/mpool/sm/mpool_sm_module.c 34:#include "orte/mca/sstore/sstore.h" - -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] ompi headers in OPAL source
I found a few more OMPI header files included in OPAL source code. Can the respective owners clean this stuff up? Thanks! - mca/btl/openib/btl_openib_component.c 87:#include "ompi/mca/rte/rte.h" mca/btl/ugni/btl_ugni_component.c 20:#include "ompi/runtime/params.h" mca/btl/ugni/btl_ugni_add_procs.c 20:#include "ompi/communicator/communicator.h" mca/btl/usnic/btl_usnic_hwloc.c 33:#include "ompi/mca/rte/rte.h" mca/btl/usnic/btl_usnic_compat.h 43:# include "ompi/mca/rte/rte.h" mca/common/ofacm/common_ofacm_xoob.c 24:#include "ompi/mca/rte/rte.h" mca/common/ofacm/common_ofacm_oob.c 35:#include "ompi/mca/rte/rte.h" mca/mpool/base/mpool_base_alloc.c 32:#include "ompi/info/info.h" /* TODO */ mca/mpool/sm/mpool_sm_module.c 36:#include "ompi/runtime/ompi_cr.h" /* TODO */ - -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Yes, I know - but the problem comes from nidmap pushing data down into the opal_db/dstore level, which then creates a copy of the data. That's where the alignment error is generated On Aug 8, 2014, at 11:17 AM, George Bosilca wrote: > On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain wrote: > Sorry to chime in a little late. George is likely correct about using > ORTE_NAME, only you can't do that as the OPAL layer has no idea what that > datatype looks like. This was the original reason for creating the > opal_identifier_t type - I had no other choice when we moved the db framework > (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. > The abstraction requirement wouldn't allow me to pass down the structure > definition. > > We are talking about nidmap.c which has not yet been moved down to OPAL. > > George. > > > The easiest solution is probably to change the opal/db/hash code so that > 64-bit fields are memcpy'd instead of simply passed by "=". This should > eliminate the problem with the least fuss. > > There is a performance penalty for using non-aligned data, and ideally we > should use aligned data whenever possible. This code isn't in the critical > path and so this is less of an issue, but still would be nice to do. However, > I didn't do so for the following reasons: > > * I couldn't find a way for the compiler to check/require alignment down in > opal_db.store when passed a parameter. If someone knows of a way to do that, > please feel free to suggest it > > * none of our current developers have access to a Solaris SPARC machine, and > thus our developers cannot detect violations when they occur > > * the current solution avoids the issue, albeit with a slight performance > penalty > > I'm open to alternative methods - I'm not happy with the ugliness this > required, but couldn't come up with a cleaner solution that would be easy for > developers to know when they violated the alignment requirement. > > FWIW: it is possible, I suppose, that the other discussion about using an > opal_process_name_t that exactly mirrors orte_process_name_t could also > resolve this problem in a cleaner fashion. I didn't impose that requirement > here, but maybe it's another motivator for doing so? > > Ralph > > > On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet > wrote: > >> George, >> >> (one of the) faulty line was : >> >>if (ORTE_SUCCESS != (rc = >> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, >> >> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { >> >> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. >> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the >> issue (i have no arch to test...) >> >> i was initially also "confused" with the following line >> >> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, >> OPAL_SCOPE_INTERNAL, >> ORTE_DB_NPROC_OFFSET, >> &offset, OPAL_UINT32))) { >> >> the first argument of store is an (opal_identifier_t *) >> strictly speaking this is "a pointer to a 64 bits aligned address", and proc >> might not be 64 bits aligned. >> /* that being said, there is no crash :-) */ >> >> in this case, opal_db.store pointer points to the store function >> (db_hash.c:178) >> and proc is only used id memcpy at line 194, so 64 bits alignment is not >> required. >> (and comment is explicit : /* to protect alignment, copy the data across */ >> >> that might sounds pedantic, but are we doing the right thing here ? >> (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the >> pointer was not 64 bits aligned >> vs always use aligned data ?) >> >> Cheers, >> >> Gilles >> >> On 2014/08/08 14:58, George Bosilca wrote: >>> This is a gigantic patch for an almost trivial issue. The current problem >>> is purely related to the fact that in a single location (nidmap.c) the >>> orte_process_name_t (which is a structure of 2 integers) is supposed to be >>> aligned based on the uint64_t requirements. Bad assumption! >>> >>> Looking at the code one might notice that the orte_process_name_t is stored >>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold >>> on the SPARC architecture because the two types (int32_t and int64_t) have >>> different alignments. However, ORTE define a type for orte_process_name_t. >>> Thus, I think that if instead of saving the orte_process_name_t as an >>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. >>> >>> George. >>> >>> >>> >>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < >>> gilles.gouaillar...@iferc.org> wrote: >>> Kawashima-san and all, Here is attached a one off patch for v1.8. /* it does not use the __attribute__ modifier that might not be supported by all compi
Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC
I will attempt to confirm on my Solaris-10 system ASAP. That will allow me to finally be certain that the other static linking issue has been resolved. -Paul On Fri, Aug 8, 2014 at 11:39 AM, Jeff Squyres (jsquyres) wrote: > Thanks! > > On Aug 8, 2014, at 2:30 PM, George Bosilca wrote: > > > r32467 should fix the problem. > > > > George. > > > > > > On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > That'll do it... > > > > George: can you fix? > > > > > > On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > > > > > I think it might be getting pulled in from this include: > > > > > > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" > > > > > > > > > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > > > > >> Weirdness; I don't see any name like that in the SM BTL. > > >> > > >> I see it used in the OMPI layer... not sure how it's being using down > in the btl SM component file...? > > >> > > >> > > >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove > wrote: > > >> > > >>> Testing r32448 on trunk for trac issue #4834, I encounter the > following which appears unrelated to #4834: > > >>> > > >>> CCLD orte-info > > >>> Undefined first referenced > > >>> symbol in file > > >>> ompi_proc_local_proc > > /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o) > > >>> ld: fatal: Symbol referencing errors. No output written to orte-info > > >>> > > >>> Note that this is *static* linking. > > >>> > > >>> This appears to indicate a call from OPAL to OMPI, and I am guessing > this is a side-effect of the BTL move. > > >>> > > >>> Since OMPI contains (many) calls to OPAL this is a circular library > dependence. > > >>> Unfortunately, some linkers process their argument strictly > left-to-right. > > >>> Thus if this dependence is not eliminated one may need "-lmpi > -lopen-pal -lmpi" (or similar) to resolve it. > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15565.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15566.php > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC
Thanks! On Aug 8, 2014, at 2:30 PM, George Bosilca wrote: > r32467 should fix the problem. > > George. > > > On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) > wrote: > That'll do it... > > George: can you fix? > > > On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > > > I think it might be getting pulled in from this include: > > > > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" > > > > > > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) > > wrote: > > > >> Weirdness; I don't see any name like that in the SM BTL. > >> > >> I see it used in the OMPI layer... not sure how it's being using down in > >> the btl SM component file...? > >> > >> > >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove wrote: > >> > >>> Testing r32448 on trunk for trac issue #4834, I encounter the following > >>> which appears unrelated to #4834: > >>> > >>> CCLD orte-info > >>> Undefined first referenced > >>> symbol in file > >>> ompi_proc_local_proc > >>> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o) > >>> ld: fatal: Symbol referencing errors. No output written to orte-info > >>> > >>> Note that this is *static* linking. > >>> > >>> This appears to indicate a call from OPAL to OMPI, and I am guessing this > >>> is a side-effect of the BTL move. > >>> > >>> Since OMPI contains (many) calls to OPAL this is a circular library > >>> dependence. > >>> Unfortunately, some linkers process their argument strictly left-to-right. > >>> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal > >>> -lmpi" (or similar) to resolve it. > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15565.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC
r32467 should fix the problem. George. On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) wrote: > That'll do it... > > George: can you fix? > > > On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > > > I think it might be getting pulled in from this include: > > > > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" > > > > > > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) > wrote: > > > >> Weirdness; I don't see any name like that in the SM BTL. > >> > >> I see it used in the OMPI layer... not sure how it's being using down > in the btl SM component file...? > >> > >> > >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove wrote: > >> > >>> Testing r32448 on trunk for trac issue #4834, I encounter the > following which appears unrelated to #4834: > >>> > >>> CCLD orte-info > >>> Undefined first referenced > >>> symbol in file > >>> ompi_proc_local_proc > > /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o) > >>> ld: fatal: Symbol referencing errors. No output written to orte-info > >>> > >>> Note that this is *static* linking. > >>> > >>> This appears to indicate a call from OPAL to OMPI, and I am guessing > this is a side-effect of the BTL move. > >>> > >>> Since OMPI contains (many) calls to OPAL this is a circular library > dependence. > >>> Unfortunately, some linkers process their argument strictly > left-to-right. > >>> Thus if this dependence is not eliminated one may need "-lmpi > -lopen-pal -lmpi" (or similar) to resolve it. >
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain wrote: > Sorry to chime in a little late. George is likely correct about using > ORTE_NAME, only you can't do that as the OPAL layer has no idea what that > datatype looks like. This was the original reason for creating the > opal_identifier_t type - I had no other choice when we moved the db > framework (now dstore) to the OPAL layer in anticipation of the BTLs moving > to OPAL. The abstraction requirement wouldn't allow me to pass down the > structure definition. > We are talking about nidmap.c which has not yet been moved down to OPAL. George. > > The easiest solution is probably to change the opal/db/hash code so that > 64-bit fields are memcpy'd instead of simply passed by "=". This should > eliminate the problem with the least fuss. > > There is a performance penalty for using non-aligned data, and ideally we > should use aligned data whenever possible. This code isn't in the critical > path and so this is less of an issue, but still would be nice to do. > However, I didn't do so for the following reasons: > > * I couldn't find a way for the compiler to check/require alignment down > in opal_db.store when passed a parameter. If someone knows of a way to do > that, please feel free to suggest it > > * none of our current developers have access to a Solaris SPARC machine, > and thus our developers cannot detect violations when they occur > > * the current solution avoids the issue, albeit with a slight performance > penalty > > I'm open to alternative methods - I'm not happy with the ugliness this > required, but couldn't come up with a cleaner solution that would be easy > for developers to know when they violated the alignment requirement. > > FWIW: it is possible, I suppose, that the other discussion about using an > opal_process_name_t that exactly mirrors orte_process_name_t could also > resolve this problem in a cleaner fashion. I didn't impose that requirement > here, but maybe it's another motivator for doing so? > > Ralph > > > On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > George, > > (one of the) faulty line was : > >if (ORTE_SUCCESS != (rc = > opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, > > OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { > > so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. > as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix > the issue (i have no arch to test...) > > i was initially also "confused" with the following line > > if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, > OPAL_SCOPE_INTERNAL, > ORTE_DB_NPROC_OFFSET, > &offset, OPAL_UINT32))) { > > the first argument of store is an (opal_identifier_t *) > strictly speaking this is "a pointer to a 64 bits aligned address", and > proc might not be 64 bits aligned. > /* that being said, there is no crash :-) */ > > in this case, opal_db.store pointer points to the store function > (db_hash.c:178) > and proc is only used id memcpy at line 194, so 64 bits alignment is not > required. > (and comment is explicit : /* to protect alignment, copy the data across > */ > > that might sounds pedantic, but are we doing the right thing here ? > (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the > pointer was not 64 bits aligned > vs always use aligned data ?) > > Cheers, > > Gilles > > On 2014/08/08 14:58, George Bosilca wrote: > > This is a gigantic patch for an almost trivial issue. The current problem > is purely related to the fact that in a single location (nidmap.c) the > orte_process_name_t (which is a structure of 2 integers) is supposed to be > aligned based on the uint64_t requirements. Bad assumption! > > Looking at the code one might notice that the orte_process_name_t is stored > using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold > on the SPARC architecture because the two types (int32_t and int64_t) have > different alignments. However, ORTE define a type for orte_process_name_t. > Thus, I think that if instead of saving the orte_process_name_t as an > OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. > > George. > > > > On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet > wrote: > > > Kawashima-san and all, > > Here is attached a one off patch for v1.8. > /* it does not use the __attribute__ modifier that might not be > supported by all compilers */ > > as far as i am concerned, the same issue is also in the trunk, > and if you do not hit it, it just means you are lucky :-) > > the same issue might also be in other parts of the code :-( > > Cheers, > > Gilles > > On 2014/08/08 13:45, Kawashima, Takahiro wrote: > > Gilles, George, > > The problem is the one Gilles pointed. > I temporarily modified the code bellow and the bus error disappeared. > > --- orte/util/nidmap.c (revision 32447
Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC
That'll do it... George: can you fix? On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > I think it might be getting pulled in from this include: > > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" > > > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) > wrote: > >> Weirdness; I don't see any name like that in the SM BTL. >> >> I see it used in the OMPI layer... not sure how it's being using down in the >> btl SM component file...? >> >> >> On Aug 7, 2014, at 11:25 PM, Paul Hargrove wrote: >> >>> Testing r32448 on trunk for trac issue #4834, I encounter the following >>> which appears unrelated to #4834: >>> >>> CCLD orte-info >>> Undefined first referenced >>> symbol in file >>> ompi_proc_local_proc >>> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o) >>> ld: fatal: Symbol referencing errors. No output written to orte-info >>> >>> Note that this is *static* linking. >>> >>> This appears to indicate a call from OPAL to OMPI, and I am guessing this >>> is a side-effect of the BTL move. >>> >>> Since OMPI contains (many) calls to OPAL this is a circular library >>> dependence. >>> Unfortunately, some linkers process their argument strictly left-to-right. >>> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal >>> -lmpi" (or similar) to resolve it. >>> >>> -Paul >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15540.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15553.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15562.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC
I think it might be getting pulled in from this include: opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) wrote: > Weirdness; I don't see any name like that in the SM BTL. > > I see it used in the OMPI layer... not sure how it's being using down in the > btl SM component file...? > > > On Aug 7, 2014, at 11:25 PM, Paul Hargrove wrote: > >> Testing r32448 on trunk for trac issue #4834, I encounter the following >> which appears unrelated to #4834: >> >> CCLD orte-info >> Undefined first referenced >> symbol in file >> ompi_proc_local_proc >> /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o) >> ld: fatal: Symbol referencing errors. No output written to orte-info >> >> Note that this is *static* linking. >> >> This appears to indicate a call from OPAL to OMPI, and I am guessing this is >> a side-effect of the BTL move. >> >> Since OMPI contains (many) calls to OPAL this is a circular library >> dependence. >> Unfortunately, some linkers process their argument strictly left-to-right. >> Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal >> -lmpi" (or similar) to resolve it. >> >> -Paul >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15540.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15553.php
Re: [OMPI devel] Open MPI SVN -> Git (github) conversion
Done; thanks. On Aug 8, 2014, at 11:05 AM, Tim Mattox wrote: > Jeff, > I may someday again be working for an organization that is an Open MPI > contributor... so could you > update my e-mail address in the authors.txt file to be "timattox = Tim Mattox > " > Thanks! > > > On Fri, Aug 8, 2014 at 11:00 AM, Jeff Squyres (jsquyres) > wrote: > SHORT VERSION > = > > Please verify/update the email address that you'd like me to use for your > Open MPI commits when we do the git conversion: > > https://github.com/open-mpi/authors > > Updates are due by COB Friday, 15 Aug, 2014 (1 week from today). > > MORE DETAIL > === > > Dave and I are continuing to work on the logistics of the SVN -> Git > conversion. > > As part of the process, I need email addresses for which you'd like your > commits to appear in the git repo. Please see this git repo for the current > list of email addresses that I have, as well as instructions for how to > update them: > > https://github.com/open-mpi/authors > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/1.php > > > > -- > Tim Mattox, Ph.D. - tmat...@gmail.com > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15556.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] ibm abort test hangs on one node
Committed a fix for this in r32460 - see if I got it! On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet wrote: > Folks, > > here is the description of a hang i briefly mentionned a few days ago. > > with the trunk (i did not check 1.8 ...) simply run on one node : > mpirun -np 2 --mca btl sm,self ./abort > > (the abort test is taken from the ibm test suite : process 0 call > MPI_Abort while process 1 enters an infinite loop) > > there is a race condition : sometimes it hangs, sometimes it aborts > nicely as expected. > when the hang occurs, both abort processes have exited and mpirun waits > forever > > i made some investigations and i have now a better idea of what happens > (but i am still clueless on how to fix this) > > when process 0 abort, it : > - closes the tcp socket connected to mpirun > - closes the pipe connected to mpirun > - send SIGCHLD to mpirun > > then on mpirun : > when SIGCHLD is received, the handler basically writes 17 (the signal > number) to a socketpair. > then libevent will return from a poll and here is the race condition, > basically : > if revents is non zero for the three fds (socket, pipe and socketpair) > then the program will abort nicely > if revents is non zero for both socket and pipe but is zero for the > socketpair, then the mpirun will hang > > i digged a bit deeper and found that when the event on the socketpair is > processed, it will end up calling > odls_base_default_wait_local_proc. > if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program > will abort nicely > *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the > program will hang > > an other way to put this is that > when the program aborts nicely, the call sequence is > odls_base_default_wait_local_proc > proc_errors(vpid=0) > proc_errors(vpid=0) > proc_errors(vpid=1) > proc_errors(vpid=1) > > when the program hangs, the call sequence is > proc_errors(vpid=0) > odls_base_default_wait_local_proc > proc_errors(vpid=0) > proc_errors(vpid=1) > proc_errors(vpid=1) > > i will resume this on Monday unless someone can fix this in the mean > time :-) > > Cheers, > > Gilles > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15552.php
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Committed a fix for this in r32459 - please check and see if this resolves the issue. On Aug 8, 2014, at 2:21 AM, Ralph Castain wrote: > Sorry to chime in a little late. George is likely correct about using > ORTE_NAME, only you can't do that as the OPAL layer has no idea what that > datatype looks like. This was the original reason for creating the > opal_identifier_t type - I had no other choice when we moved the db framework > (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. > The abstraction requirement wouldn't allow me to pass down the structure > definition. > > The easiest solution is probably to change the opal/db/hash code so that > 64-bit fields are memcpy'd instead of simply passed by "=". This should > eliminate the problem with the least fuss. > > There is a performance penalty for using non-aligned data, and ideally we > should use aligned data whenever possible. This code isn't in the critical > path and so this is less of an issue, but still would be nice to do. However, > I didn't do so for the following reasons: > > * I couldn't find a way for the compiler to check/require alignment down in > opal_db.store when passed a parameter. If someone knows of a way to do that, > please feel free to suggest it > > * none of our current developers have access to a Solaris SPARC machine, and > thus our developers cannot detect violations when they occur > > * the current solution avoids the issue, albeit with a slight performance > penalty > > I'm open to alternative methods - I'm not happy with the ugliness this > required, but couldn't come up with a cleaner solution that would be easy for > developers to know when they violated the alignment requirement. > > FWIW: it is possible, I suppose, that the other discussion about using an > opal_process_name_t that exactly mirrors orte_process_name_t could also > resolve this problem in a cleaner fashion. I didn't impose that requirement > here, but maybe it's another motivator for doing so? > > Ralph > > > On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet > wrote: > >> George, >> >> (one of the) faulty line was : >> >>if (ORTE_SUCCESS != (rc = >> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, >> >> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { >> >> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. >> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the >> issue (i have no arch to test...) >> >> i was initially also "confused" with the following line >> >> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, >> OPAL_SCOPE_INTERNAL, >> ORTE_DB_NPROC_OFFSET, >> &offset, OPAL_UINT32))) { >> >> the first argument of store is an (opal_identifier_t *) >> strictly speaking this is "a pointer to a 64 bits aligned address", and proc >> might not be 64 bits aligned. >> /* that being said, there is no crash :-) */ >> >> in this case, opal_db.store pointer points to the store function >> (db_hash.c:178) >> and proc is only used id memcpy at line 194, so 64 bits alignment is not >> required. >> (and comment is explicit : /* to protect alignment, copy the data across */ >> >> that might sounds pedantic, but are we doing the right thing here ? >> (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the >> pointer was not 64 bits aligned >> vs always use aligned data ?) >> >> Cheers, >> >> Gilles >> >> On 2014/08/08 14:58, George Bosilca wrote: >>> This is a gigantic patch for an almost trivial issue. The current problem >>> is purely related to the fact that in a single location (nidmap.c) the >>> orte_process_name_t (which is a structure of 2 integers) is supposed to be >>> aligned based on the uint64_t requirements. Bad assumption! >>> >>> Looking at the code one might notice that the orte_process_name_t is stored >>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold >>> on the SPARC architecture because the two types (int32_t and int64_t) have >>> different alignments. However, ORTE define a type for orte_process_name_t. >>> Thus, I think that if instead of saving the orte_process_name_t as an >>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. >>> >>> George. >>> >>> >>> >>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < >>> gilles.gouaillar...@iferc.org> wrote: >>> Kawashima-san and all, Here is attached a one off patch for v1.8. /* it does not use the __attribute__ modifier that might not be supported by all compilers */ as far as i am concerned, the same issue is also in the trunk, and if you do not hit it, it just means you are lucky :-) the same issue might also be in other parts of the code :-( Cheers, Gilles
Re: [OMPI devel] jenkins error in trunk
Fixed in r32462 On Aug 8, 2014, at 8:13 AM, Mike Dubman wrote: > > Josh,Devendar - could you please take a look? > Thanks > > 15:45:00 Making install in mca/coll/fca > 15:45:00 make[2]: Entering directory > `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca' > 15:45:00 CC coll_fca_module.lo > 15:45:00 coll_fca_module.c: In function 'have_remote_peers': > 15:45:00 coll_fca_module.c:48: error: 'ompi_proc_t' has no member named > 'proc_flags' > 15:45:00 coll_fca_module.c:48: error: 'ompi_proc_t' has no member named > 'proc_flags' > 15:45:00 coll_fca_module.c: In function '__get_local_ranks': > 15:45:00 coll_fca_module.c:75: error: 'ompi_proc_t' has no member named > 'proc_flags' > 15:45:00 coll_fca_module.c:75: error: 'ompi_proc_t' has no member named > 'proc_flags' > 15:45:00 coll_fca_module.c:95: error: 'ompi_proc_t' has no member named > 'proc_flags' > 15:45:00 coll_fca_module.c:95: error: 'ompi_proc_t' has no member named > 'proc_flags' > 15:45:00 make[2]: *** [coll_fca_module.lo] Error 1 > 15:45:00 make[2]: Leaving directory > `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca' > 15:45:00 make[1]: *** [install-recursive] Error 1 > 15:45:00 make[1]: Leaving directory > `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi' > 15:45:00 make: *** [install-recursive] Error 1 > 15:45:00 Build step 'Execute shell' marked build as failu > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15557.php
[OMPI devel] jenkins error in trunk
*Josh,Devendar - could you please take a look?* *Thanks* *15:45:00* Making install in mca/coll/fca*15:45:00* make[2]: Entering directory `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'*15:45:00* CC coll_fca_module.lo*15:45:00* coll_fca_module.c: In function 'have_remote_peers':*15:45:00* coll_fca_module.c:48: error: 'ompi_proc_t' has no member named 'proc_flags'*15:45:00* coll_fca_module.c:48: error: 'ompi_proc_t' has no member named 'proc_flags'*15:45:00* coll_fca_module.c: In function '__get_local_ranks':*15:45:00* coll_fca_module.c:75: error: 'ompi_proc_t' has no member named 'proc_flags'*15:45:00* coll_fca_module.c:75: error: 'ompi_proc_t' has no member named 'proc_flags'*15:45:00* coll_fca_module.c:95: error: 'ompi_proc_t' has no member named 'proc_flags'*15:45:00* coll_fca_module.c:95: error: 'ompi_proc_t' has no member named 'proc_flags'*15:45:00* make[2]: *** [coll_fca_module.lo] Error 1*15:45:00* make[2]: Leaving directory `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'*15:45:00* make[1]: *** [install-recursive] Error 1*15:45:00* make[1]: Leaving directory `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi'*15:45:00* make: *** [install-recursive] Error 1*15:45:00* Build step 'Execute shell' marked build as failu
Re: [OMPI devel] Open MPI SVN -> Git (github) conversion
Jeff, I may someday again be working for an organization that is an Open MPI contributor... so could you update my e-mail address in the authors.txt file to be "timattox = Tim Mattox " Thanks! On Fri, Aug 8, 2014 at 11:00 AM, Jeff Squyres (jsquyres) wrote: > SHORT VERSION > = > > Please verify/update the email address that you'd like me to use for your > Open MPI commits when we do the git conversion: > > https://github.com/open-mpi/authors > > Updates are due by COB Friday, 15 Aug, 2014 (1 week from today). > > MORE DETAIL > === > > Dave and I are continuing to work on the logistics of the SVN -> Git > conversion. > > As part of the process, I need email addresses for which you'd like your > commits to appear in the git repo. Please see this git repo for the > current list of email addresses that I have, as well as instructions for > how to update them: > > https://github.com/open-mpi/authors > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/1.php > -- Tim Mattox, Ph.D. - tmat...@gmail.com
[OMPI devel] Open MPI SVN -> Git (github) conversion
SHORT VERSION = Please verify/update the email address that you'd like me to use for your Open MPI commits when we do the git conversion: https://github.com/open-mpi/authors Updates are due by COB Friday, 15 Aug, 2014 (1 week from today). MORE DETAIL === Dave and I are continuing to work on the logistics of the SVN -> Git conversion. As part of the process, I need email addresses for which you'd like your commits to appear in the git repo. Please see this git repo for the current list of email addresses that I have, as well as instructions for how to update them: https://github.com/open-mpi/authors Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] errors and warnings with show_help() usage
SHORT VERSION = The ./contrib/check-help-strings.pl script is showing ***47 coding errors*** with regards to using show_help() in components. Here's a summary of the offenders: - ORTE (lumped together because there's a single maintainer :-) ) - smcuda and cuda - common/verbs - bcol - mxm - openib - oshmem Could the owners of these portions of the code base please run ./contrib/check-help-strings.pl and fix the ERRORs that are shown? Thanks! MORE DETAIL === The first part of ./contrib/check-help-strings.pl's output shows ERRORs -- referring to help files that do not exist, or referring to help topics that do not exist. I'm only calling out the ERRORs in this mail -- but the second part of the output shows a bazillion WARNINGs, too. These are help topics that are probably unused -- they don't seem to be referenced by the code anywhere. It would be good to clean up all the WARNINGs, too, but the ERRORs are more worrisome. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC
Weirdness; I don't see any name like that in the SM BTL. I see it used in the OMPI layer... not sure how it's being using down in the btl SM component file...? On Aug 7, 2014, at 11:25 PM, Paul Hargrove wrote: > Testing r32448 on trunk for trac issue #4834, I encounter the following which > appears unrelated to #4834: > > CCLD orte-info > Undefined first referenced > symbol in file > ompi_proc_local_proc > /sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o) > ld: fatal: Symbol referencing errors. No output written to orte-info > > Note that this is *static* linking. > > This appears to indicate a call from OPAL to OMPI, and I am guessing this is > a side-effect of the BTL move. > > Since OMPI contains (many) calls to OPAL this is a circular library > dependence. > Unfortunately, some linkers process their argument strictly left-to-right. > Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal > -lmpi" (or similar) to resolve it. > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15540.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] ibm abort test hangs on one node
Folks, here is the description of a hang i briefly mentionned a few days ago. with the trunk (i did not check 1.8 ...) simply run on one node : mpirun -np 2 --mca btl sm,self ./abort (the abort test is taken from the ibm test suite : process 0 call MPI_Abort while process 1 enters an infinite loop) there is a race condition : sometimes it hangs, sometimes it aborts nicely as expected. when the hang occurs, both abort processes have exited and mpirun waits forever i made some investigations and i have now a better idea of what happens (but i am still clueless on how to fix this) when process 0 abort, it : - closes the tcp socket connected to mpirun - closes the pipe connected to mpirun - send SIGCHLD to mpirun then on mpirun : when SIGCHLD is received, the handler basically writes 17 (the signal number) to a socketpair. then libevent will return from a poll and here is the race condition, basically : if revents is non zero for the three fds (socket, pipe and socketpair) then the program will abort nicely if revents is non zero for both socket and pipe but is zero for the socketpair, then the mpirun will hang i digged a bit deeper and found that when the event on the socketpair is processed, it will end up calling odls_base_default_wait_local_proc. if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program will abort nicely *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the program will hang an other way to put this is that when the program aborts nicely, the call sequence is odls_base_default_wait_local_proc proc_errors(vpid=0) proc_errors(vpid=0) proc_errors(vpid=1) proc_errors(vpid=1) when the program hangs, the call sequence is proc_errors(vpid=0) odls_base_default_wait_local_proc proc_errors(vpid=0) proc_errors(vpid=1) proc_errors(vpid=1) i will resume this on Monday unless someone can fix this in the mean time :-) Cheers, Gilles
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Sorry to chime in a little late. George is likely correct about using ORTE_NAME, only you can't do that as the OPAL layer has no idea what that datatype looks like. This was the original reason for creating the opal_identifier_t type - I had no other choice when we moved the db framework (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. The abstraction requirement wouldn't allow me to pass down the structure definition. The easiest solution is probably to change the opal/db/hash code so that 64-bit fields are memcpy'd instead of simply passed by "=". This should eliminate the problem with the least fuss. There is a performance penalty for using non-aligned data, and ideally we should use aligned data whenever possible. This code isn't in the critical path and so this is less of an issue, but still would be nice to do. However, I didn't do so for the following reasons: * I couldn't find a way for the compiler to check/require alignment down in opal_db.store when passed a parameter. If someone knows of a way to do that, please feel free to suggest it * none of our current developers have access to a Solaris SPARC machine, and thus our developers cannot detect violations when they occur * the current solution avoids the issue, albeit with a slight performance penalty I'm open to alternative methods - I'm not happy with the ugliness this required, but couldn't come up with a cleaner solution that would be easy for developers to know when they violated the alignment requirement. FWIW: it is possible, I suppose, that the other discussion about using an opal_process_name_t that exactly mirrors orte_process_name_t could also resolve this problem in a cleaner fashion. I didn't impose that requirement here, but maybe it's another motivator for doing so? Ralph On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet wrote: > George, > > (one of the) faulty line was : > >if (ORTE_SUCCESS != (rc = > opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, > OPAL_DB_LOCALLDR, > (opal_identifier_t*)&proc, OPAL_ID_T))) { > > so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. > as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the > issue (i have no arch to test...) > > i was initially also "confused" with the following line > > if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, > OPAL_SCOPE_INTERNAL, > ORTE_DB_NPROC_OFFSET, > &offset, OPAL_UINT32))) { > > the first argument of store is an (opal_identifier_t *) > strictly speaking this is "a pointer to a 64 bits aligned address", and proc > might not be 64 bits aligned. > /* that being said, there is no crash :-) */ > > in this case, opal_db.store pointer points to the store function > (db_hash.c:178) > and proc is only used id memcpy at line 194, so 64 bits alignment is not > required. > (and comment is explicit : /* to protect alignment, copy the data across */ > > that might sounds pedantic, but are we doing the right thing here ? > (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the > pointer was not 64 bits aligned > vs always use aligned data ?) > > Cheers, > > Gilles > > On 2014/08/08 14:58, George Bosilca wrote: >> This is a gigantic patch for an almost trivial issue. The current problem >> is purely related to the fact that in a single location (nidmap.c) the >> orte_process_name_t (which is a structure of 2 integers) is supposed to be >> aligned based on the uint64_t requirements. Bad assumption! >> >> Looking at the code one might notice that the orte_process_name_t is stored >> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold >> on the SPARC architecture because the two types (int32_t and int64_t) have >> different alignments. However, ORTE define a type for orte_process_name_t. >> Thus, I think that if instead of saving the orte_process_name_t as an >> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. >> >> George. >> >> >> >> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> Kawashima-san and all, >>> >>> Here is attached a one off patch for v1.8. >>> /* it does not use the __attribute__ modifier that might not be >>> supported by all compilers */ >>> >>> as far as i am concerned, the same issue is also in the trunk, >>> and if you do not hit it, it just means you are lucky :-) >>> >>> the same issue might also be in other parts of the code :-( >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/08/08 13:45, Kawashima, Takahiro wrote: Gilles, George, The problem is the one Gilles pointed. I temporarily modified the code bellow and the bus error disappeared. --- orte/util/nidmap.c (revision 32447) +++ orte/util/nidmap.c (working copy)
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
George, (one of the) faulty line was : if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the issue (i have no arch to test...) i was initially also "confused" with the following line if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, ORTE_DB_NPROC_OFFSET, &offset, OPAL_UINT32))) { the first argument of store is an (opal_identifier_t *) strictly speaking this is "a pointer to a 64 bits aligned address", and proc might not be 64 bits aligned. /* that being said, there is no crash :-) */ in this case, opal_db.store pointer points to the store function (db_hash.c:178) and proc is only used id memcpy at line 194, so 64 bits alignment is not required. (and comment is explicit :/* to protect alignment, copy the data across */ that might sounds pedantic, but are we doing the right thing here ? (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the pointer was not 64 bits aligned vs always use aligned data ?) Cheers, Gilles On 2014/08/08 14:58, George Bosilca wrote: > This is a gigantic patch for an almost trivial issue. The current problem > is purely related to the fact that in a single location (nidmap.c) the > orte_process_name_t (which is a structure of 2 integers) is supposed to be > aligned based on the uint64_t requirements. Bad assumption! > > Looking at the code one might notice that the orte_process_name_t is stored > using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold > on the SPARC architecture because the two types (int32_t and int64_t) have > different alignments. However, ORTE define a type for orte_process_name_t. > Thus, I think that if instead of saving the orte_process_name_t as an > OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. > > George. > > > > On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Kawashima-san and all, >> >> Here is attached a one off patch for v1.8. >> /* it does not use the __attribute__ modifier that might not be >> supported by all compilers */ >> >> as far as i am concerned, the same issue is also in the trunk, >> and if you do not hit it, it just means you are lucky :-) >> >> the same issue might also be in other parts of the code :-( >> >> Cheers, >> >> Gilles >> >> On 2014/08/08 13:45, Kawashima, Takahiro wrote: >>> Gilles, George, >>> >>> The problem is the one Gilles pointed. >>> I temporarily modified the code bellow and the bus error disappeared. >>> >>> --- orte/util/nidmap.c (revision 32447) >>> +++ orte/util/nidmap.c (working copy) >>> @@ -885,7 +885,7 @@ >>> orte_proc_state_t state; >>> orte_app_idx_t app_idx; >>> int32_t restarts; >>> -orte_process_name_t proc, dmn; >>> +orte_process_name_t proc __attribute__((__aligned__(8))), dmn; >>> char *hostname; >>> uint8_t flag; >>> opal_buffer_t *bptr; >>> >>> Takahiro Kawashima, >>> MPI development team, >>> Fujitsu >>> Kawashima-san, This is interesting :-) proc is in the stack and has type orte_process_name_t with typedef uint32_t orte_jobid_t; typedef uint32_t orte_vpid_t; struct orte_process_name_t { orte_jobid_t jobid; /**< Job number */ orte_vpid_t vpid; /**< Process id - equivalent to rank */ }; typedef struct orte_process_name_t orte_process_name_t; so there is really no reason to align this on 8 bytes... but later, proc is casted into an uint64_t ... so proc should have been aligned on 8 bytes but it is too late, and hence the glory SIGBUS this is loosely related to http://www.open-mpi.org/community/lists/devel/2014/08/15532.php (see heterogeneous.v2.patch) if we make opal_process_name_t an union of uint64_t and a struct of two uint32_t, the compiler will align this on 8 bytes. note the patch is not enough (and will not apply on the v1.8 branch >> anyway), we could simply remove orte_process_name_t and ompi_process_name_t and use only opal_process_name_t (and never declare variables with type opal_proc_name_t otherwise alignment might be incorrect) as a workaround, you can declare an opal_process_name_t (for alignment), and cast it to an orte_process_name_t i will write a patch (i will not be able to test on sparc ...) please note this issue might be present in other places Cheers, Gilles On 2014/08/08 13:03, Kawashima, Takahiro wrote: > Hi, > >> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >> 10 Sparc and I receive a
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
This is a gigantic patch for an almost trivial issue. The current problem is purely related to the fact that in a single location (nidmap.c) the orte_process_name_t (which is a structure of 2 integers) is supposed to be aligned based on the uint64_t requirements. Bad assumption! Looking at the code one might notice that the orte_process_name_t is stored using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold on the SPARC architecture because the two types (int32_t and int64_t) have different alignments. However, ORTE define a type for orte_process_name_t. Thus, I think that if instead of saving the orte_process_name_t as an OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. George. On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Kawashima-san and all, > > Here is attached a one off patch for v1.8. > /* it does not use the __attribute__ modifier that might not be > supported by all compilers */ > > as far as i am concerned, the same issue is also in the trunk, > and if you do not hit it, it just means you are lucky :-) > > the same issue might also be in other parts of the code :-( > > Cheers, > > Gilles > > On 2014/08/08 13:45, Kawashima, Takahiro wrote: > > Gilles, George, > > > > The problem is the one Gilles pointed. > > I temporarily modified the code bellow and the bus error disappeared. > > > > --- orte/util/nidmap.c (revision 32447) > > +++ orte/util/nidmap.c (working copy) > > @@ -885,7 +885,7 @@ > > orte_proc_state_t state; > > orte_app_idx_t app_idx; > > int32_t restarts; > > -orte_process_name_t proc, dmn; > > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn; > > char *hostname; > > uint8_t flag; > > opal_buffer_t *bptr; > > > > Takahiro Kawashima, > > MPI development team, > > Fujitsu > > > >> Kawashima-san, > >> > >> This is interesting :-) > >> > >> proc is in the stack and has type orte_process_name_t > >> > >> with > >> > >> typedef uint32_t orte_jobid_t; > >> typedef uint32_t orte_vpid_t; > >> struct orte_process_name_t { > >> orte_jobid_t jobid; /**< Job number */ > >> orte_vpid_t vpid; /**< Process id - equivalent to rank */ > >> }; > >> typedef struct orte_process_name_t orte_process_name_t; > >> > >> > >> so there is really no reason to align this on 8 bytes... > >> but later, proc is casted into an uint64_t ... > >> so proc should have been aligned on 8 bytes but it is too late, > >> and hence the glory SIGBUS > >> > >> > >> this is loosely related to > >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php > >> (see heterogeneous.v2.patch) > >> if we make opal_process_name_t an union of uint64_t and a struct of two > >> uint32_t, the compiler > >> will align this on 8 bytes. > >> note the patch is not enough (and will not apply on the v1.8 branch > anyway), > >> we could simply remove orte_process_name_t and ompi_process_name_t and > >> use only > >> opal_process_name_t (and never declare variables with type > >> opal_proc_name_t otherwise alignment might be incorrect) > >> > >> as a workaround, you can declare an opal_process_name_t (for alignment), > >> and cast it to an orte_process_name_t > >> > >> i will write a patch (i will not be able to test on sparc ...) > >> please note this issue might be present in other places > >> > >> Cheers, > >> > >> Gilles > >> > >> On 2014/08/08 13:03, Kawashima, Takahiro wrote: > >>> Hi, > >>> > I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > 10 Sparc and I receive a bus error, if I run a small program. > >>> I've finally reproduced the bus error in my SPARC environment. > >>> > >>> #0 0x00db4740 (__waitpid_nocancel + 0x44) > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) > >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in > ../sigattach.c > >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line > 252 in db_hash.c > >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long > *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line > 49 in db_base_fns.c > >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) > 0x00281d70) at line 975 in nidmap.c > >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c > >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in > ess_env_module.c > >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) > 0x,pargv=(char ***) 0x,flags=32) at line > 148 in orte_init.c > >>> #8 0x001a6f08 (ompi_mpi_ini
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Gilles, I applied your patch to v1.8 and it run successfully on my SPARC machines. Takahiro Kawashima, MPI development team, Fujitsu > Kawashima-san and all, > > Here is attached a one off patch for v1.8. > /* it does not use the __attribute__ modifier that might not be > supported by all compilers */ > > as far as i am concerned, the same issue is also in the trunk, > and if you do not hit it, it just means you are lucky :-) > > the same issue might also be in other parts of the code :-( > > Cheers, > > Gilles > > On 2014/08/08 13:45, Kawashima, Takahiro wrote: > > Gilles, George, > > > > The problem is the one Gilles pointed. > > I temporarily modified the code bellow and the bus error disappeared. > > > > --- orte/util/nidmap.c (revision 32447) > > +++ orte/util/nidmap.c (working copy) > > @@ -885,7 +885,7 @@ > > orte_proc_state_t state; > > orte_app_idx_t app_idx; > > int32_t restarts; > > -orte_process_name_t proc, dmn; > > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn; > > char *hostname; > > uint8_t flag; > > opal_buffer_t *bptr; > > > > Takahiro Kawashima, > > MPI development team, > > Fujitsu > > > >> Kawashima-san, > >> > >> This is interesting :-) > >> > >> proc is in the stack and has type orte_process_name_t > >> > >> with > >> > >> typedef uint32_t orte_jobid_t; > >> typedef uint32_t orte_vpid_t; > >> struct orte_process_name_t { > >> orte_jobid_t jobid; /**< Job number */ > >> orte_vpid_t vpid; /**< Process id - equivalent to rank */ > >> }; > >> typedef struct orte_process_name_t orte_process_name_t; > >> > >> > >> so there is really no reason to align this on 8 bytes... > >> but later, proc is casted into an uint64_t ... > >> so proc should have been aligned on 8 bytes but it is too late, > >> and hence the glory SIGBUS > >> > >> > >> this is loosely related to > >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php > >> (see heterogeneous.v2.patch) > >> if we make opal_process_name_t an union of uint64_t and a struct of two > >> uint32_t, the compiler > >> will align this on 8 bytes. > >> note the patch is not enough (and will not apply on the v1.8 branch > >> anyway), > >> we could simply remove orte_process_name_t and ompi_process_name_t and > >> use only > >> opal_process_name_t (and never declare variables with type > >> opal_proc_name_t otherwise alignment might be incorrect) > >> > >> as a workaround, you can declare an opal_process_name_t (for alignment), > >> and cast it to an orte_process_name_t > >> > >> i will write a patch (i will not be able to test on sparc ...) > >> please note this issue might be present in other places > >> > >> Cheers, > >> > >> Gilles > >> > >> On 2014/08/08 13:03, Kawashima, Takahiro wrote: > >>> Hi, > >>> > I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > 10 Sparc and I receive a bus error, if I run a small program. > >>> I've finally reproduced the bus error in my SPARC environment. > >>> > >>> #0 0x00db4740 (__waitpid_nocancel + 0x44) > >>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) > >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct > >>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 > >>> in ../sigattach.c > >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) > >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > >>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line > >>> 252 in db_hash.c > >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) > >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > >>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at > >>> line 49 in db_base_fns.c > >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) > >>> 0x00281d70) at line 975 in nidmap.c > >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct > >>> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c > >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in > >>> ess_env_module.c > >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) > >>> 0x,pargv=(char ***) 0x,flags=32) at line > >>> 148 in orte_init.c > >>> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) > >>> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at > >>> line 464 in ompi_mpi_init.c > >>> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) > >>> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in > >>> init.c > >>> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) > >>> 0x07fef348) at line 8 in mpiinitfinalize.c > >>> #11 0x00d2b81c (__libc_start_main + 0x194) > >>> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) > >>> #12 0x0010094c (_start + 0x2c) () > >>> > >>> The line 2
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Kawashima-san and all, Here is attached a one off patch for v1.8. /* it does not use the __attribute__ modifier that might not be supported by all compilers */ as far as i am concerned, the same issue is also in the trunk, and if you do not hit it, it just means you are lucky :-) the same issue might also be in other parts of the code :-( Cheers, Gilles On 2014/08/08 13:45, Kawashima, Takahiro wrote: > Gilles, George, > > The problem is the one Gilles pointed. > I temporarily modified the code bellow and the bus error disappeared. > > --- orte/util/nidmap.c (revision 32447) > +++ orte/util/nidmap.c (working copy) > @@ -885,7 +885,7 @@ > orte_proc_state_t state; > orte_app_idx_t app_idx; > int32_t restarts; > -orte_process_name_t proc, dmn; > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn; > char *hostname; > uint8_t flag; > opal_buffer_t *bptr; > > Takahiro Kawashima, > MPI development team, > Fujitsu > >> Kawashima-san, >> >> This is interesting :-) >> >> proc is in the stack and has type orte_process_name_t >> >> with >> >> typedef uint32_t orte_jobid_t; >> typedef uint32_t orte_vpid_t; >> struct orte_process_name_t { >> orte_jobid_t jobid; /**< Job number */ >> orte_vpid_t vpid; /**< Process id - equivalent to rank */ >> }; >> typedef struct orte_process_name_t orte_process_name_t; >> >> >> so there is really no reason to align this on 8 bytes... >> but later, proc is casted into an uint64_t ... >> so proc should have been aligned on 8 bytes but it is too late, >> and hence the glory SIGBUS >> >> >> this is loosely related to >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php >> (see heterogeneous.v2.patch) >> if we make opal_process_name_t an union of uint64_t and a struct of two >> uint32_t, the compiler >> will align this on 8 bytes. >> note the patch is not enough (and will not apply on the v1.8 branch anyway), >> we could simply remove orte_process_name_t and ompi_process_name_t and >> use only >> opal_process_name_t (and never declare variables with type >> opal_proc_name_t otherwise alignment might be incorrect) >> >> as a workaround, you can declare an opal_process_name_t (for alignment), >> and cast it to an orte_process_name_t >> >> i will write a patch (i will not be able to test on sparc ...) >> please note this issue might be present in other places >> >> Cheers, >> >> Gilles >> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote: >>> Hi, >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris 10 Sparc and I receive a bus error, if I run a small program. >>> I've finally reproduced the bus error in my SPARC environment. >>> >>> #0 0x00db4740 (__waitpid_nocancel + 0x44) >>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct >>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in >>> ../sigattach.c >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 >>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line >>> 252 in db_hash.c >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 >>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line >>> 49 in db_base_fns.c >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) >>> 0x00281d70) at line 975 in nidmap.c >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct >>> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) >>> 0x,pargv=(char ***) 0x,flags=32) at line >>> 148 in orte_init.c >>> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) >>> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line >>> 464 in ompi_mpi_init.c >>> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) >>> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c >>> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) >>> 0x07fef348) at line 8 in mpiinitfinalize.c >>> #11 0x00d2b81c (__libc_start_main + 0x194) >>> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) >>> #12 0x0010094c (_start + 0x2c) () >>> >>> The line 252 in opal/mca/db/hash/db_hash.c is: >>> >>> case OPAL_UINT64: >>> if (NULL == data) { >>> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); >>> return OPAL_ERR_BAD_PARAM; >>> } >>> kv->type = OPAL_UINT64; >>> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! >>> break; >>> >>> My environment is: >>> >>> Open MPI v1.8 branch r32447 (lates
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Gilles, George, The problem is the one Gilles pointed. I temporarily modified the code bellow and the bus error disappeared. --- orte/util/nidmap.c (revision 32447) +++ orte/util/nidmap.c (working copy) @@ -885,7 +885,7 @@ orte_proc_state_t state; orte_app_idx_t app_idx; int32_t restarts; -orte_process_name_t proc, dmn; +orte_process_name_t proc __attribute__((__aligned__(8))), dmn; char *hostname; uint8_t flag; opal_buffer_t *bptr; Takahiro Kawashima, MPI development team, Fujitsu > Kawashima-san, > > This is interesting :-) > > proc is in the stack and has type orte_process_name_t > > with > > typedef uint32_t orte_jobid_t; > typedef uint32_t orte_vpid_t; > struct orte_process_name_t { > orte_jobid_t jobid; /**< Job number */ > orte_vpid_t vpid; /**< Process id - equivalent to rank */ > }; > typedef struct orte_process_name_t orte_process_name_t; > > > so there is really no reason to align this on 8 bytes... > but later, proc is casted into an uint64_t ... > so proc should have been aligned on 8 bytes but it is too late, > and hence the glory SIGBUS > > > this is loosely related to > http://www.open-mpi.org/community/lists/devel/2014/08/15532.php > (see heterogeneous.v2.patch) > if we make opal_process_name_t an union of uint64_t and a struct of two > uint32_t, the compiler > will align this on 8 bytes. > note the patch is not enough (and will not apply on the v1.8 branch anyway), > we could simply remove orte_process_name_t and ompi_process_name_t and > use only > opal_process_name_t (and never declare variables with type > opal_proc_name_t otherwise alignment might be incorrect) > > as a workaround, you can declare an opal_process_name_t (for alignment), > and cast it to an orte_process_name_t > > i will write a patch (i will not be able to test on sparc ...) > please note this issue might be present in other places > > Cheers, > > Gilles > > On 2014/08/08 13:03, Kawashima, Takahiro wrote: > > Hi, > > > >> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > >> 10 Sparc and I receive a bus error, if I run a small program. > > I've finally reproduced the bus error in my SPARC environment. > > > > #0 0x00db4740 (__waitpid_nocancel + 0x44) > > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) > > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct > > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in > > ../sigattach.c > > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) > > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line > > 252 in db_hash.c > > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) > > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line > > 49 in db_base_fns.c > > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) > > 0x00281d70) at line 975 in nidmap.c > > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct > > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c > > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c > > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) > > 0x,pargv=(char ***) 0x,flags=32) at line > > 148 in orte_init.c > > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) > > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line > > 464 in ompi_mpi_init.c > > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) > > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c > > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) > > 0x07fef348) at line 8 in mpiinitfinalize.c > > #11 0x00d2b81c (__libc_start_main + 0x194) > > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) > > #12 0x0010094c (_start + 0x2c) () > > > > The line 252 in opal/mca/db/hash/db_hash.c is: > > > > case OPAL_UINT64: > > if (NULL == data) { > > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); > > return OPAL_ERR_BAD_PARAM; > > } > > kv->type = OPAL_UINT64; > > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! > > break; > > > > My environment is: > > > > Open MPI v1.8 branch r32447 (latest) > > configure --enable-debug > > SPARC-V9 (Fujitsu SPARC64 IXfx) > > Linux (custom) > > gcc 4.2.4 > > > > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. > > > > Can this information help? > > > > Takahiro Kawashima, > > MPI development team, > > Fujitsu > > > >> Hi, > >> > >> I'm sorry once more to answer late, but the last two days our mail > >> server was down (hardware error). > >> > >>> Did you configure this --enable-debug? > >> Yes, I
Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value
Paul's tests identified an small issue with the previous patch (a real corner-case for ARM v5). The patch below is fixing all known issues. Btw, there is still room for volunteers for the .asm work. George. On Tue, Aug 5, 2014 at 2:23 PM, George Bosilca wrote: > Thanks to Paul help all the inlined atomics have been tested. The new > patch is attached below. However, this only fixes the inline atomics, all > those generated from the *.asm files have not been updated. Any volunteer? > > George. > > > > On Aug 1, 2014, at 18:09 , Paul Hargrove wrote: > > I have confirmed that George's latest version works on both SPARC ABIs. > > ARMv7 and three MIPS ABIs still pending... > > -Paul > > > On Fri, Aug 1, 2014 at 9:40 AM, George Bosilca > wrote: > >> Another version of the atomic patch. Paul has tested it on a bunch of >> platforms. At this point we have confirmation from all architectures except >> SPARC (v8+ and v9). >> >> George. >> >> >> >> On Jul 31, 2014, at 19:13 , George Bosilca wrote: >> >> > All, >> > >> > Here is the patch that change the meaning of the atomics to make them >> always return the previous value (similar to sync_fetch_and_<*>). I tested >> this with the following atomics: OS X, gcc style intrinsics and AMD64. >> > >> > I did not change the base assembly files used when GCC style assembly >> operations are not supported. If someone feels like fixing them, feel free. >> > >> > Paul, I know you have a pretty diverse range computers. Can you try to >> compile and run a “make check” with the following patch? >> > >> > George. >> > >> > >> > >> > On Jul 30, 2014, at 15:21 , Nathan Hjelm wrote: >> > >> >> >> >> That is what I would prefer. I was trying to not disturb things too >> >> much :). Please bring the changes over! >> >> >> >> -Nathan >> >> >> >> On Wed, Jul 30, 2014 at 03:18:44PM -0400, George Bosilca wrote: >> >>> Why do you want to add new versions? This will lead to having two, >> almost >> >>> identical, sets of atomics that are conceptually equivalent but >> different >> >>> in terms of code. And we will have to maintained both! >> >>> I did a similar change in a fork of OPAL in another project but >> instead of >> >>> adding another flavor of atomics, I completely replaced the >> available ones >> >>> with a set returning the old value. I can bring the code over. >> >>>George. >> >>> >> >>> On Tue, Jul 29, 2014 at 5:29 PM, Paul Hargrove >> wrote: >> >>> >> >>>On Tue, Jul 29, 2014 at 2:10 PM, Nathan Hjelm >> wrote: >> >>> >> >>> Is there a reason why the >> >>> current implementations of opal atomics (add, cmpset) do not >> return >> >>> the >> >>> old value? >> >>> >> >>>Because some CPUs don't implement such an atomic instruction? >> >>> >> >>>On any CPU one *can* certainly synthesize the desired operation >> with an >> >>>added read before the compare-and-swap to return a value that was >> >>>present at some time before a failed cmpset. That is almost >> certainly >> >>>sufficient for your purposes. However, the added load makes it >> >>>(marginally) more expensive on some CPUs that only have the native >> >>>equivalent of gcc's __sync_bool_compare_and_swap(). >> > atomics.patch Description: Binary data
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Hi George, > Takahiro you can confirm this by printing the value of data when signal is > raised. It's in the trace. 0x07fede74 #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 in db_hash.c I want to dig this issue, but unfortunately I have no time today. My SPARC machines stop one hour later for the maintenance... Takahiro Kawashima, MPI development team, Fujitsu > I have an extremely vague recollection about a similar issue in the > datatype engine: on the SPARC architecture the 64 bits integers must be > aligned on a 64bits boundary or you get a bus error. > > Takahiro you can confirm this by printing the value of data when signal is > raised. > > George. > > > > On Fri, Aug 8, 2014 at 12:03 AM, Kawashima, Takahiro < > t-kawash...@jp.fujitsu.com> wrote: > > > Hi, > > > > > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > > > > >>> 10 Sparc and I receive a bus error, if I run a small program. > > > > I've finally reproduced the bus error in my SPARC environment. > > > > #0 0x00db4740 (__waitpid_nocancel + 0x44) > > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) > > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct > > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in > > ../sigattach.c > > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) > > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line > > 252 in db_hash.c > > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) > > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line > > 49 in db_base_fns.c > > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) > > 0x00281d70) at line 975 in nidmap.c > > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct > > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c > > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c > > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) > > 0x,pargv=(char ***) 0x,flags=32) at line > > 148 in orte_init.c > > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) > > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line > > 464 in ompi_mpi_init.c > > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) > > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c > > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) > > 0x07fef348) at line 8 in mpiinitfinalize.c > > #11 0x00d2b81c (__libc_start_main + 0x194) > > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) > > #12 0x0010094c (_start + 0x2c) () > > > > The line 252 in opal/mca/db/hash/db_hash.c is: > > > > case OPAL_UINT64: > > if (NULL == data) { > > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); > > return OPAL_ERR_BAD_PARAM; > > } > > kv->type = OPAL_UINT64; > > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! > > break; > > > > My environment is: > > > > Open MPI v1.8 branch r32447 (latest) > > configure --enable-debug > > SPARC-V9 (Fujitsu SPARC64 IXfx) > > Linux (custom) > > gcc 4.2.4 > > > > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. > > > > Can this information help? > > > > Takahiro Kawashima, > > MPI development team, > > Fujitsu > > > > > Hi, > > > > > > I'm sorry once more to answer late, but the last two days our mail > > > server was down (hardware error). > > > > > > > Did you configure this --enable-debug? > > > > > > Yes, I used the following command. > > > > > > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \ > > > --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ > > > --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ > > > --with-jdk-headers=/usr/local/jdk1.8.0/include \ > > > JAVA_HOME=/usr/local/jdk1.8.0 \ > > > LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ > > > CC="gcc" CXX="g++" FC="gfortran" \ > > > CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ > > > CPP="cpp" CXXCPP="cpp" \ > > > CPPFLAGS="" CXXCPPFLAGS="" \ > > > --enable-mpi-cxx \ > > > --enable-cxx-exceptions \ > > > --enable-mpi-java \ > > > --enable-heterogeneous \ > > > --enable-mpi-thread-multiple \ > > > --with-threads=posix \ > > > --with-hwloc=internal \ > > > --without-verbs \ > > > --with-wrapper-cflags="-std=c11 -m64" \ > > > --enable-debug \ > > > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc > > > > > > > > > > > > > If so, you should get a line number in the backtrace > > > > > > I got them for gdb (see below)
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Kawashima-san, This is interesting :-) proc is in the stack and has type orte_process_name_t with typedef uint32_t orte_jobid_t; typedef uint32_t orte_vpid_t; struct orte_process_name_t { orte_jobid_t jobid; /**< Job number */ orte_vpid_t vpid; /**< Process id - equivalent to rank */ }; typedef struct orte_process_name_t orte_process_name_t; so there is really no reason to align this on 8 bytes... but later, proc is casted into an uint64_t ... so proc should have been aligned on 8 bytes but it is too late, and hence the glory SIGBUS this is loosely related to http://www.open-mpi.org/community/lists/devel/2014/08/15532.php (see heterogeneous.v2.patch) if we make opal_process_name_t an union of uint64_t and a struct of two uint32_t, the compiler will align this on 8 bytes. note the patch is not enough (and will not apply on the v1.8 branch anyway), we could simply remove orte_process_name_t and ompi_process_name_t and use only opal_process_name_t (and never declare variables with type opal_proc_name_t otherwise alignment might be incorrect) as a workaround, you can declare an opal_process_name_t (for alignment), and cast it to an orte_process_name_t i will write a patch (i will not be able to test on sparc ...) please note this issue might be present in other places Cheers, Gilles On 2014/08/08 13:03, Kawashima, Takahiro wrote: > Hi, > >> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >> 10 Sparc and I receive a bus error, if I run a small program. > I've finally reproduced the bus error in my SPARC environment. > > #0 0x00db4740 (__waitpid_nocancel + 0x44) > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo > *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in > ../sigattach.c > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 > in db_hash.c > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line > 49 in db_base_fns.c > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) > 0x00281d70) at line 975 in nidmap.c > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) > 0x,pargv=(char ***) 0x,flags=32) at line 148 > in orte_init.c > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line > 464 in ompi_mpi_init.c > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) > 0x07fef348) at line 8 in mpiinitfinalize.c > #11 0x00d2b81c (__libc_start_main + 0x194) > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) > #12 0x0010094c (_start + 0x2c) () > > The line 252 in opal/mca/db/hash/db_hash.c is: > > case OPAL_UINT64: > if (NULL == data) { > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); > return OPAL_ERR_BAD_PARAM; > } > kv->type = OPAL_UINT64; > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! > break; > > My environment is: > > Open MPI v1.8 branch r32447 (latest) > configure --enable-debug > SPARC-V9 (Fujitsu SPARC64 IXfx) > Linux (custom) > gcc 4.2.4 > > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. > > Can this information help? > > Takahiro Kawashima, > MPI development team, > Fujitsu > >> Hi, >> >> I'm sorry once more to answer late, but the last two days our mail >> server was down (hardware error). >> >>> Did you configure this --enable-debug? >> Yes, I used the following command. >> >> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \ >> --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ >> --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ >> --with-jdk-headers=/usr/local/jdk1.8.0/include \ >> JAVA_HOME=/usr/local/jdk1.8.0 \ >> LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ >> CC="gcc" CXX="g++" FC="gfortran" \ >> CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ >> CPP="cpp" CXXCPP="cpp" \ >> CPPFLAGS="" CXXCPPFLAGS="" \ >> --enable-mpi-cxx \ >> --enable-cxx-exceptions \ >> --enable-mpi-java \ >> --enable-heterogeneous \ >> --enable-mpi-thread-multiple \ >> --with-threads=posix \ >> --with-hwloc=internal \ >> --without-verbs \ >> --with-wrapper-cflags="-
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
I have an extremely vague recollection about a similar issue in the datatype engine: on the SPARC architecture the 64 bits integers must be aligned on a 64bits boundary or you get a bus error. Takahiro you can confirm this by printing the value of data when signal is raised. George. On Fri, Aug 8, 2014 at 12:03 AM, Kawashima, Takahiro < t-kawash...@jp.fujitsu.com> wrote: > Hi, > > > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > > > >>> 10 Sparc and I receive a bus error, if I run a small program. > > I've finally reproduced the bus error in my SPARC environment. > > #0 0x00db4740 (__waitpid_nocancel + 0x44) > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in > ../sigattach.c > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line > 252 in db_hash.c > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line > 49 in db_base_fns.c > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) > 0x00281d70) at line 975 in nidmap.c > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) > 0x,pargv=(char ***) 0x,flags=32) at line > 148 in orte_init.c > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line > 464 in ompi_mpi_init.c > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) > 0x07fef348) at line 8 in mpiinitfinalize.c > #11 0x00d2b81c (__libc_start_main + 0x194) > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) > #12 0x0010094c (_start + 0x2c) () > > The line 252 in opal/mca/db/hash/db_hash.c is: > > case OPAL_UINT64: > if (NULL == data) { > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); > return OPAL_ERR_BAD_PARAM; > } > kv->type = OPAL_UINT64; > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! > break; > > My environment is: > > Open MPI v1.8 branch r32447 (latest) > configure --enable-debug > SPARC-V9 (Fujitsu SPARC64 IXfx) > Linux (custom) > gcc 4.2.4 > > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. > > Can this information help? > > Takahiro Kawashima, > MPI development team, > Fujitsu > > > Hi, > > > > I'm sorry once more to answer late, but the last two days our mail > > server was down (hardware error). > > > > > Did you configure this --enable-debug? > > > > Yes, I used the following command. > > > > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \ > > --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ > > --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ > > --with-jdk-headers=/usr/local/jdk1.8.0/include \ > > JAVA_HOME=/usr/local/jdk1.8.0 \ > > LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ > > CC="gcc" CXX="g++" FC="gfortran" \ > > CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ > > CPP="cpp" CXXCPP="cpp" \ > > CPPFLAGS="" CXXCPPFLAGS="" \ > > --enable-mpi-cxx \ > > --enable-cxx-exceptions \ > > --enable-mpi-java \ > > --enable-heterogeneous \ > > --enable-mpi-thread-multiple \ > > --with-threads=posix \ > > --with-hwloc=internal \ > > --without-verbs \ > > --with-wrapper-cflags="-std=c11 -m64" \ > > --enable-debug \ > > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc > > > > > > > > > If so, you should get a line number in the backtrace > > > > I got them for gdb (see below), but not for "dbx". > > > > > > Kind regards > > > > Siegmar > > > > > > > > > > > > > > > On Aug 5, 2014, at 2:59 AM, Siegmar Gross > > wrote: > > > > > > > Hi, > > > > > > > > I'm sorry to answer so late, but last week I didn't have Internet > > > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get > > > > the same error. > > > > > > > >> This looks like the typical type of alignment error that we used > > > >> to see when testing regularly on SPARC. :-\ > > > >> > > > >> It looks like the error was happening in mca_db_hash.so. Could > > > >> you get a stack trace / file+line number where it was failing > > > >> in mca_db_hash? (i.e., the actual bad code will likely be under > > > >> opal/mca/db/hash somewhere) > > > > > > > > Unfortunat
Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc
Hi, > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > > >>> 10 Sparc and I receive a bus error, if I run a small program. I've finally reproduced the bus error in my SPARC environment. #0 0x00db4740 (__waitpid_nocancel + 0x44) (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4) #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in ../sigattach.c #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 in db_hash.c #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 49 in db_base_fns.c #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 0x00281d70) at line 975 in nidmap.c #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 0x,pargv=(char ***) 0x,flags=32) at line 148 in orte_init.c #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 464 in ompi_mpi_init.c #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 0x07fef348) at line 8 in mpiinitfinalize.c #11 0x00d2b81c (__libc_start_main + 0x194) (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0) #12 0x0010094c (_start + 0x2c) () The line 252 in opal/mca/db/hash/db_hash.c is: case OPAL_UINT64: if (NULL == data) { OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); return OPAL_ERR_BAD_PARAM; } kv->type = OPAL_UINT64; kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! break; My environment is: Open MPI v1.8 branch r32447 (latest) configure --enable-debug SPARC-V9 (Fujitsu SPARC64 IXfx) Linux (custom) gcc 4.2.4 I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. Can this information help? Takahiro Kawashima, MPI development team, Fujitsu > Hi, > > I'm sorry once more to answer late, but the last two days our mail > server was down (hardware error). > > > Did you configure this --enable-debug? > > Yes, I used the following command. > > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \ > --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ > --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ > --with-jdk-headers=/usr/local/jdk1.8.0/include \ > JAVA_HOME=/usr/local/jdk1.8.0 \ > LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ > CC="gcc" CXX="g++" FC="gfortran" \ > CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ > CPP="cpp" CXXCPP="cpp" \ > CPPFLAGS="" CXXCPPFLAGS="" \ > --enable-mpi-cxx \ > --enable-cxx-exceptions \ > --enable-mpi-java \ > --enable-heterogeneous \ > --enable-mpi-thread-multiple \ > --with-threads=posix \ > --with-hwloc=internal \ > --without-verbs \ > --with-wrapper-cflags="-std=c11 -m64" \ > --enable-debug \ > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc > > > > > If so, you should get a line number in the backtrace > > I got them for gdb (see below), but not for "dbx". > > > Kind regards > > Siegmar > > > > > > > > > On Aug 5, 2014, at 2:59 AM, Siegmar Gross > wrote: > > > > > Hi, > > > > > > I'm sorry to answer so late, but last week I didn't have Internet > > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get > > > the same error. > > > > > >> This looks like the typical type of alignment error that we used > > >> to see when testing regularly on SPARC. :-\ > > >> > > >> It looks like the error was happening in mca_db_hash.so. Could > > >> you get a stack trace / file+line number where it was failing > > >> in mca_db_hash? (i.e., the actual bad code will likely be under > > >> opal/mca/db/hash somewhere) > > > > > > Unfortunately I don't get a file+line number from a file in > > > opal/mca/db/Hash. > > > > > > > > > > > > tyr small_prog 102 ompi_info | grep MPI: > > >Open MPI: 1.8.2rc3 > > > tyr small_prog 103 which mpicc > > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc > > > tyr small_prog 104 mpicc init_finalize.c > > > tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec > > > For information about new features see `help changes' > > > To remove this message, put `dbxenv suppress_startup_message 7.9' in your > .dbxrc > > > Reading mp