Re: [OMPI devel] v1.8.2 still held up...

2014-08-08 Thread Paul Hargrove
On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain wrote: > * fixes to coll/ml that expanded to fixing page alignment in general - > someone needs to review/approve it: > https://svn.open-mpi.org/trac/ompi/ticket/4826 > I've been able to confirm that the nightly tarball (1.8.2rc4r32480) work

Re: [OMPI devel] v1.8.2 still held up...

2014-08-08 Thread Paul Hargrove
On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain wrote: > * static linking failure - Gilles has posted a proposed fix, but somebody > needs to approve and CMR it. Please see: > https://svn.open-mpi.org/trac/ompi/ticket/4834 > Jeff moved the fix to v1.8 in r32471. I have tested tonight's t

Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-08 Thread Hjelm, Nathan Thomas
I will try to take a look this week and see what I can do. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of George Bosilca [bosi...@icl.utk.edu] Sent: Thursday, August 07, 2014 10:37 PM To: Open MPI Developers Subject: Re: [OMPI devel] RFC: ad

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-08 Thread George Bosilca
These are harmless. They are only used when FT is enabled which should rarely be the case. George. On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) wrote: > Here's a few ORTE headers in OPAL source -- can respective owners clean > these up? Thanks. > > - > mca/btl/smcuda/btl_smc

[OMPI devel] ORTE headers in OPAL source

2014-08-08 Thread Jeff Squyres (jsquyres)
Here's a few ORTE headers in OPAL source -- can respective owners clean these up? Thanks. - mca/btl/smcuda/btl_smcuda.c 63:#include "orte/mca/sstore/sstore.h" mca/btl/sm/btl_sm.c 62:#include "orte/mca/sstore/sstore.h" mca/mpool/sm/mpool_sm_module.c 34:#include "orte/mca/sstore/sstore.h" --

[OMPI devel] ompi headers in OPAL source

2014-08-08 Thread Jeff Squyres (jsquyres)
I found a few more OMPI header files included in OPAL source code. Can the respective owners clean this stuff up? Thanks! - mca/btl/openib/btl_openib_component.c 87:#include "ompi/mca/rte/rte.h" mca/btl/ugni/btl_ugni_component.c 20:#include "ompi/runtime/params.h" mca/btl/ugni/btl_ugni_ad

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Yes, I know - but the problem comes from nidmap pushing data down into the opal_db/dstore level, which then creates a copy of the data. That's where the alignment error is generated On Aug 8, 2014, at 11:17 AM, George Bosilca wrote: > On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain wrote: > So

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Paul Hargrove
I will attempt to confirm on my Solaris-10 system ASAP. That will allow me to finally be certain that the other static linking issue has been resolved. -Paul On Fri, Aug 8, 2014 at 11:39 AM, Jeff Squyres (jsquyres) wrote: > Thanks! > > On Aug 8, 2014, at 2:30 PM, George Bosilca wrote: > > > r

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Jeff Squyres (jsquyres)
Thanks! On Aug 8, 2014, at 2:30 PM, George Bosilca wrote: > r32467 should fix the problem. > > George. > > > On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) > wrote: > That'll do it... > > George: can you fix? > > > On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > > > I thi

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread George Bosilca
r32467 should fix the problem. George. On Fri, Aug 8, 2014 at 1:20 PM, Jeff Squyres (jsquyres) wrote: > That'll do it... > > George: can you fix? > > > On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > > > I think it might be getting pulled in from this include: > > > > opal/mca/common/sm/

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain wrote: > Sorry to chime in a little late. George is likely correct about using > ORTE_NAME, only you can't do that as the OPAL layer has no idea what that > datatype looks like. This was the original reason for creating the > opal_identifier_t type -

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Jeff Squyres (jsquyres)
That'll do it... George: can you fix? On Aug 8, 2014, at 1:11 PM, Ralph Castain wrote: > I think it might be getting pulled in from this include: > > opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" > > > On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) > wrote: > >> We

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Ralph Castain
I think it might be getting pulled in from this include: opal/mca/common/sm/common_sm.h:37:#include "ompi/group/group.h" On Aug 8, 2014, at 5:33 AM, Jeff Squyres (jsquyres) wrote: > Weirdness; I don't see any name like that in the SM BTL. > > I see it used in the OMPI layer... not sure how it

Re: [OMPI devel] Open MPI SVN -> Git (github) conversion

2014-08-08 Thread Jeff Squyres (jsquyres)
Done; thanks. On Aug 8, 2014, at 11:05 AM, Tim Mattox wrote: > Jeff, > I may someday again be working for an organization that is an Open MPI > contributor... so could you > update my e-mail address in the authors.txt file to be "timattox = Tim Mattox > " > Thanks! > > > On Fri, Aug 8, 2014

Re: [OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Ralph Castain
Committed a fix for this in r32460 - see if I got it! On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet wrote: > Folks, > > here is the description of a hang i briefly mentionned a few days ago. > > with the trunk (i did not check 1.8 ...) simply run on one node : > mpirun -np 2 --mca btl sm,se

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Committed a fix for this in r32459 - please check and see if this resolves the issue. On Aug 8, 2014, at 2:21 AM, Ralph Castain wrote: > Sorry to chime in a little late. George is likely correct about using > ORTE_NAME, only you can't do that as the OPAL layer has no idea what that > datatyp

Re: [OMPI devel] jenkins error in trunk

2014-08-08 Thread Ralph Castain
Fixed in r32462 On Aug 8, 2014, at 8:13 AM, Mike Dubman wrote: > > Josh,Devendar - could you please take a look? > Thanks > > 15:45:00 Making install in mca/coll/fca > 15:45:00 make[2]: Entering directory > `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'

[OMPI devel] jenkins error in trunk

2014-08-08 Thread Mike Dubman
*Josh,Devendar - could you please take a look?* *Thanks* *15:45:00* Making install in mca/coll/fca*15:45:00* make[2]: Entering directory `/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi/mca/coll/fca'*15:45:00* CC coll_fca_module.lo*15:45:00* coll_fca_module.c: In

Re: [OMPI devel] Open MPI SVN -> Git (github) conversion

2014-08-08 Thread Tim Mattox
Jeff, I may someday again be working for an organization that is an Open MPI contributor... so could you update my e-mail address in the authors.txt file to be "timattox = Tim Mattox " Thanks! On Fri, Aug 8, 2014 at 11:00 AM, Jeff Squyres (jsquyres) wrote: > SHORT VERSION > = > > Pl

[OMPI devel] Open MPI SVN -> Git (github) conversion

2014-08-08 Thread Jeff Squyres (jsquyres)
SHORT VERSION = Please verify/update the email address that you'd like me to use for your Open MPI commits when we do the git conversion: https://github.com/open-mpi/authors Updates are due by COB Friday, 15 Aug, 2014 (1 week from today). MORE DETAIL === Dave and I are

[OMPI devel] errors and warnings with show_help() usage

2014-08-08 Thread Jeff Squyres (jsquyres)
SHORT VERSION = The ./contrib/check-help-strings.pl script is showing ***47 coding errors*** with regards to using show_help() in components. Here's a summary of the offenders: - ORTE (lumped together because there's a single maintainer :-) ) - smcuda and cuda - common/verbs - bcol

Re: [OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-08 Thread Jeff Squyres (jsquyres)
Weirdness; I don't see any name like that in the SM BTL. I see it used in the OMPI layer... not sure how it's being using down in the btl SM component file...? On Aug 7, 2014, at 11:25 PM, Paul Hargrove wrote: > Testing r32448 on trunk for trac issue #4834, I encounter the following which >

[OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Gilles Gouaillardet
Folks, here is the description of a hang i briefly mentionned a few days ago. with the trunk (i did not check 1.8 ...) simply run on one node : mpirun -np 2 --mca btl sm,self ./abort (the abort test is taken from the ibm test suite : process 0 call MPI_Abort while process 1 enters an infinite lo

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Sorry to chime in a little late. George is likely correct about using ORTE_NAME, only you can't do that as the OPAL layer has no idea what that datatype looks like. This was the original reason for creating the opal_identifier_t type - I had no other choice when we moved the db framework (now d

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
George, (one of the) faulty line was : if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. as you point

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
This is a gigantic patch for an almost trivial issue. The current problem is purely related to the fact that in a single location (nidmap.c) the orte_process_name_t (which is a structure of 2 integers) is supposed to be aligned based on the uint64_t requirements. Bad assumption! Looking at the cod

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles, I applied your patch to v1.8 and it run successfully on my SPARC machines. Takahiro Kawashima, MPI development team, Fujitsu > Kawashima-san and all, > > Here is attached a one off patch for v1.8. > /* it does not use the __attribute__ modifier that might not be > supported by all compi

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san and all, Here is attached a one off patch for v1.8. /* it does not use the __attribute__ modifier that might not be supported by all compilers */ as far as i am concerned, the same issue is also in the trunk, and if you do not hit it, it just means you are lucky :-) the same issue

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles, George, The problem is the one Gilles pointed. I temporarily modified the code bellow and the bus error disappeared. --- orte/util/nidmap.c (revision 32447) +++ orte/util/nidmap.c (working copy) @@ -885,7 +885,7 @@ orte_proc_state_t state; orte_app_idx_t app_idx; int32_t

Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-08 Thread George Bosilca
Paul's tests identified an small issue with the previous patch (a real corner-case for ARM v5). The patch below is fixing all known issues. Btw, there is still room for volunteers for the .asm work. George. On Tue, Aug 5, 2014 at 2:23 PM, George Bosilca wrote: > Thanks to Paul help all the

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi George, > Takahiro you can confirm this by printing the value of data when signal is > raised. It's in the trace. 0x07fede74 #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 "opal.local.ldr",data=(void *) 0x

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san, This is interesting :-) proc is in the stack and has type orte_process_name_t with typedef uint32_t orte_jobid_t; typedef uint32_t orte_vpid_t; struct orte_process_name_t { orte_jobid_t jobid; /**< Job number */ orte_vpid_t vpid; /**< Process id - equivalent to

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
I have an extremely vague recollection about a similar issue in the datatype engine: on the SPARC architecture the 64 bits integers must be aligned on a 64bits boundary or you get a bus error. Takahiro you can confirm this by printing the value of data when signal is raised. George. On Fri, Au

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi, > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris > > >>> 10 Sparc and I receive a bus error, if I run a small program. I've finally reproduced the bus error in my SPARC environment. #0 0x00db4740 (__waitpid_nocancel + 0x44) (0x200,0x0,0x0,0xa0,0xf80100064af0,0