Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi,

> > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > >>> 10 Sparc and I receive a bus error, if I run a small program.

I've finally reproduced the bus error in my SPARC environment.

#0 0x00db4740 (__waitpid_nocancel + 0x44) 
(0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
#1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo *) 
0x07fed100,p=(void *) 0x07fed100) at line 277 in ../sigattach.c 

#2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
in db_hash.c
#3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 49 
in db_base_fns.c
#4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
0x00281d70) at line 975 in nidmap.c
#5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
#6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
#7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
0x,pargv=(char ***) 0x,flags=32) at line 148 in 
orte_init.c
#8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 464 
in ompi_mpi_init.c
#9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
#10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 0x07fef348) 
at line 8 in mpiinitfinalize.c
#11 0x00d2b81c (__libc_start_main + 0x194) 
(0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
#12 0x0010094c (_start + 0x2c) ()

The line 252 in opal/mca/db/hash/db_hash.c is:

case OPAL_UINT64:
if (NULL == data) {
OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
return OPAL_ERR_BAD_PARAM;
}
kv->type = OPAL_UINT64;
kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
break;

My environment is:

  Open MPI v1.8 branch r32447 (latest)
  configure --enable-debug
  SPARC-V9 (Fujitsu SPARC64 IXfx)
  Linux (custom)
  gcc 4.2.4

I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.

Can this information help?

Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi,
> 
> I'm sorry once more to answer late, but the last two days our mail
> server was down (hardware error).
> 
> > Did you configure this --enable-debug?
> 
> Yes, I used the following command.
> 
> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>   JAVA_HOME=/usr/local/jdk1.8.0 \
>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>   CC="gcc" CXX="g++" FC="gfortran" \
>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   CPPFLAGS="" CXXCPPFLAGS="" \
>   --enable-mpi-cxx \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-threads=posix \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-std=c11 -m64" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> 
> 
> 
> > If so, you should get a line number in the backtrace
> 
> I got them for gdb (see below), but not for "dbx".
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> > 
> > 
> > On Aug 5, 2014, at 2:59 AM, Siegmar Gross 
>  wrote:
> > 
> > > Hi,
> > > 
> > > I'm sorry to answer so late, but last week I didn't have Internet
> > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get
> > > the same error.
> > > 
> > >> This looks like the typical type of alignment error that we used
> > >> to see when testing regularly on SPARC.  :-\
> > >> 
> > >> It looks like the error was happening in mca_db_hash.so.  Could
> > >> you get a stack trace / file+line number where it was failing
> > >> in mca_db_hash?  (i.e., the actual bad code will likely be under
> > >> opal/mca/db/hash somewhere)
> > > 
> > > Unfortunately I don't get a file+line number from a file in
> > > opal/mca/db/Hash.
> > > 
> > > 
> > > 
> > > tyr small_prog 102 ompi_info | grep MPI:
> > >Open MPI: 1.8.2rc3
> > > tyr small_prog 103 which mpicc
> > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
> > > tyr small_prog 104 mpicc init_finalize.c 
> > > tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec 
> > > For information about new features see `help changes'
> > > To remove this message, put `dbxenv suppress_startup_message 7.9' in your 
> .dbxrc
> > > Reading mp

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
I have an extremely vague recollection about a similar issue in the
datatype engine: on the SPARC architecture the 64 bits integers must be
aligned on a 64bits boundary or you get a bus error.

Takahiro you can confirm this by printing the value of data when signal is
raised.

George.



On Fri, Aug 8, 2014 at 12:03 AM, Kawashima, Takahiro <
t-kawash...@jp.fujitsu.com> wrote:

> Hi,
>
> > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > > >>> 10 Sparc and I receive a bus error, if I run a small program.
>
> I've finally reproduced the bus error in my SPARC environment.
>
> #0 0x00db4740 (__waitpid_nocancel + 0x44)
> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct
> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in
> ../sigattach.c 
> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *)
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line
> 252 in db_hash.c
> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *)
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line
> 49 in db_base_fns.c
> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
> 0x00281d70) at line 975 in nidmap.c
> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *)
> 0x,pargv=(char ***) 0x,flags=32) at line
> 148 in orte_init.c
> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **)
> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line
> 464 in ompi_mpi_init.c
> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *)
> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **)
> 0x07fef348) at line 8 in mpiinitfinalize.c
> #11 0x00d2b81c (__libc_start_main + 0x194)
> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> #12 0x0010094c (_start + 0x2c) ()
>
> The line 252 in opal/mca/db/hash/db_hash.c is:
>
> case OPAL_UINT64:
> if (NULL == data) {
> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> return OPAL_ERR_BAD_PARAM;
> }
> kv->type = OPAL_UINT64;
> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> break;
>
> My environment is:
>
>   Open MPI v1.8 branch r32447 (latest)
>   configure --enable-debug
>   SPARC-V9 (Fujitsu SPARC64 IXfx)
>   Linux (custom)
>   gcc 4.2.4
>
> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
>
> Can this information help?
>
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
> > Hi,
> >
> > I'm sorry once more to answer late, but the last two days our mail
> > server was down (hardware error).
> >
> > > Did you configure this --enable-debug?
> >
> > Yes, I used the following command.
> >
> > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
> >   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
> >   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
> >   --with-jdk-headers=/usr/local/jdk1.8.0/include \
> >   JAVA_HOME=/usr/local/jdk1.8.0 \
> >   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
> >   CC="gcc" CXX="g++" FC="gfortran" \
> >   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
> >   CPP="cpp" CXXCPP="cpp" \
> >   CPPFLAGS="" CXXCPPFLAGS="" \
> >   --enable-mpi-cxx \
> >   --enable-cxx-exceptions \
> >   --enable-mpi-java \
> >   --enable-heterogeneous \
> >   --enable-mpi-thread-multiple \
> >   --with-threads=posix \
> >   --with-hwloc=internal \
> >   --without-verbs \
> >   --with-wrapper-cflags="-std=c11 -m64" \
> >   --enable-debug \
> >   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> >
> >
> >
> > > If so, you should get a line number in the backtrace
> >
> > I got them for gdb (see below), but not for "dbx".
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> >
> > >
> > >
> > > On Aug 5, 2014, at 2:59 AM, Siegmar Gross
> >  wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm sorry to answer so late, but last week I didn't have Internet
> > > > access. In the meantime I've installed openmpi-1.8.2rc3 and I get
> > > > the same error.
> > > >
> > > >> This looks like the typical type of alignment error that we used
> > > >> to see when testing regularly on SPARC.  :-\
> > > >>
> > > >> It looks like the error was happening in mca_db_hash.so.  Could
> > > >> you get a stack trace / file+line number where it was failing
> > > >> in mca_db_hash?  (i.e., the actual bad code will likely be under
> > > >> opal/mca/db/hash somewhere)
> > > >
> > > > Unfortunat

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san,

This is interesting :-)

proc is in the stack and has type orte_process_name_t

with

typedef uint32_t orte_jobid_t;
typedef uint32_t orte_vpid_t;
struct orte_process_name_t {
orte_jobid_t jobid; /**< Job number */
orte_vpid_t vpid;   /**< Process id - equivalent to rank */
};
typedef struct orte_process_name_t orte_process_name_t;


so there is really no reason to align this on 8 bytes...
but later, proc is casted into an uint64_t ...
so proc should have been aligned on 8 bytes but it is too late,
and hence the glory SIGBUS


this is loosely related to
http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
(see heterogeneous.v2.patch)
if we make opal_process_name_t an union of uint64_t and a struct of two
uint32_t, the compiler
will align this on 8 bytes.
note the patch is not enough (and will not apply on the v1.8 branch anyway),
we could simply remove orte_process_name_t and ompi_process_name_t and
use only
opal_process_name_t (and never declare variables with type
opal_proc_name_t otherwise alignment might be incorrect)

as a workaround, you can declare an opal_process_name_t (for alignment),
and cast it to an orte_process_name_t

i will write a patch (i will not be able to test on sparc ...)
please note this issue might be present in other places

Cheers,

Gilles

On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> Hi,
>
>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>> 10 Sparc and I receive a bus error, if I run a small program.
> I've finally reproduced the bus error in my SPARC environment.
>
> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct siginfo 
> *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
> ../sigattach.c 
> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
> in db_hash.c
> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
> 49 in db_base_fns.c
> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> 0x00281d70) at line 975 in nidmap.c
> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> 0x,pargv=(char ***) 0x,flags=32) at line 148 
> in orte_init.c
> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
> 464 in ompi_mpi_init.c
> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> 0x07fef348) at line 8 in mpiinitfinalize.c
> #11 0x00d2b81c (__libc_start_main + 0x194) 
> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> #12 0x0010094c (_start + 0x2c) ()
>
> The line 252 in opal/mca/db/hash/db_hash.c is:
>
> case OPAL_UINT64:
> if (NULL == data) {
> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> return OPAL_ERR_BAD_PARAM;
> }
> kv->type = OPAL_UINT64;
> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> break;
>
> My environment is:
>
>   Open MPI v1.8 branch r32447 (latest)
>   configure --enable-debug
>   SPARC-V9 (Fujitsu SPARC64 IXfx)
>   Linux (custom)
>   gcc 4.2.4
>
> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
>
> Can this information help?
>
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
>> Hi,
>>
>> I'm sorry once more to answer late, but the last two days our mail
>> server was down (hardware error).
>>
>>> Did you configure this --enable-debug?
>> Yes, I used the following command.
>>
>> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>>   JAVA_HOME=/usr/local/jdk1.8.0 \
>>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>>   CC="gcc" CXX="g++" FC="gfortran" \
>>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>   CPP="cpp" CXXCPP="cpp" \
>>   CPPFLAGS="" CXXCPPFLAGS="" \
>>   --enable-mpi-cxx \
>>   --enable-cxx-exceptions \
>>   --enable-mpi-java \
>>   --enable-heterogeneous \
>>   --enable-mpi-thread-multiple \
>>   --with-threads=posix \
>>   --with-hwloc=internal \
>>   --without-verbs \
>>   --with-wrapper-cflags="-

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Hi George,

> Takahiro you can confirm this by printing the value of data when signal is
> raised.

It's in the trace.
0x07fede74

#2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
"opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 252 
in db_hash.c

I want to dig this issue, but unfortunately I have no time today.
My SPARC machines stop one hour later for the maintenance...

Takahiro Kawashima,
MPI development team,
Fujitsu

> I have an extremely vague recollection about a similar issue in the
> datatype engine: on the SPARC architecture the 64 bits integers must be
> aligned on a 64bits boundary or you get a bus error.
> 
> Takahiro you can confirm this by printing the value of data when signal is
> raised.
> 
> George.
> 
> 
> 
> On Fri, Aug 8, 2014 at 12:03 AM, Kawashima, Takahiro <
> t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi,
> >
> > > > >>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> > > > >>> 10 Sparc and I receive a bus error, if I run a small program.
> >
> > I've finally reproduced the bus error in my SPARC environment.
> >
> > #0 0x00db4740 (__waitpid_nocancel + 0x44)
> > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct
> > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in
> > ../sigattach.c 
> > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *)
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line
> > 252 in db_hash.c
> > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *)
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line
> > 49 in db_base_fns.c
> > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
> > 0x00281d70) at line 975 in nidmap.c
> > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
> > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *)
> > 0x,pargv=(char ***) 0x,flags=32) at line
> > 148 in orte_init.c
> > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **)
> > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line
> > 464 in ompi_mpi_init.c
> > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *)
> > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **)
> > 0x07fef348) at line 8 in mpiinitfinalize.c
> > #11 0x00d2b81c (__libc_start_main + 0x194)
> > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> > #12 0x0010094c (_start + 0x2c) ()
> >
> > The line 252 in opal/mca/db/hash/db_hash.c is:
> >
> > case OPAL_UINT64:
> > if (NULL == data) {
> > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> > return OPAL_ERR_BAD_PARAM;
> > }
> > kv->type = OPAL_UINT64;
> > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> > break;
> >
> > My environment is:
> >
> >   Open MPI v1.8 branch r32447 (latest)
> >   configure --enable-debug
> >   SPARC-V9 (Fujitsu SPARC64 IXfx)
> >   Linux (custom)
> >   gcc 4.2.4
> >
> > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
> >
> > Can this information help?
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> > > Hi,
> > >
> > > I'm sorry once more to answer late, but the last two days our mail
> > > server was down (hardware error).
> > >
> > > > Did you configure this --enable-debug?
> > >
> > > Yes, I used the following command.
> > >
> > > ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \
> > >   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
> > >   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
> > >   --with-jdk-headers=/usr/local/jdk1.8.0/include \
> > >   JAVA_HOME=/usr/local/jdk1.8.0 \
> > >   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
> > >   CC="gcc" CXX="g++" FC="gfortran" \
> > >   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
> > >   CPP="cpp" CXXCPP="cpp" \
> > >   CPPFLAGS="" CXXCPPFLAGS="" \
> > >   --enable-mpi-cxx \
> > >   --enable-cxx-exceptions \
> > >   --enable-mpi-java \
> > >   --enable-heterogeneous \
> > >   --enable-mpi-thread-multiple \
> > >   --with-threads=posix \
> > >   --with-hwloc=internal \
> > >   --without-verbs \
> > >   --with-wrapper-cflags="-std=c11 -m64" \
> > >   --enable-debug \
> > >   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
> > >
> > >
> > >
> > > > If so, you should get a line number in the backtrace
> > >
> > > I got them for gdb (see below)

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles, George,

The problem is the one Gilles pointed.
I temporarily modified the code bellow and the bus error disappeared.

--- orte/util/nidmap.c  (revision 32447)
+++ orte/util/nidmap.c  (working copy)
@@ -885,7 +885,7 @@
 orte_proc_state_t state;
 orte_app_idx_t app_idx;
 int32_t restarts;
-orte_process_name_t proc, dmn;
+orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
 char *hostname;
 uint8_t flag;
 opal_buffer_t *bptr;

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san,
> 
> This is interesting :-)
> 
> proc is in the stack and has type orte_process_name_t
> 
> with
> 
> typedef uint32_t orte_jobid_t;
> typedef uint32_t orte_vpid_t;
> struct orte_process_name_t {
> orte_jobid_t jobid; /**< Job number */
> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> };
> typedef struct orte_process_name_t orte_process_name_t;
> 
> 
> so there is really no reason to align this on 8 bytes...
> but later, proc is casted into an uint64_t ...
> so proc should have been aligned on 8 bytes but it is too late,
> and hence the glory SIGBUS
> 
> 
> this is loosely related to
> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> (see heterogeneous.v2.patch)
> if we make opal_process_name_t an union of uint64_t and a struct of two
> uint32_t, the compiler
> will align this on 8 bytes.
> note the patch is not enough (and will not apply on the v1.8 branch anyway),
> we could simply remove orte_process_name_t and ompi_process_name_t and
> use only
> opal_process_name_t (and never declare variables with type
> opal_proc_name_t otherwise alignment might be incorrect)
> 
> as a workaround, you can declare an opal_process_name_t (for alignment),
> and cast it to an orte_process_name_t
> 
> i will write a patch (i will not be able to test on sparc ...)
> please note this issue might be present in other places
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> > Hi,
> >
> >> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> >> 10 Sparc and I receive a bus error, if I run a small program.
> > I've finally reproduced the bus error in my SPARC environment.
> >
> > #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> > (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> > #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
> > siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
> > ../sigattach.c 
> > #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> > "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
> > 252 in db_hash.c
> > #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> > 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> > "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
> > 49 in db_base_fns.c
> > #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> > 0x00281d70) at line 975 in nidmap.c
> > #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> > opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> > #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
> > #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> > 0x,pargv=(char ***) 0x,flags=32) at line 
> > 148 in orte_init.c
> > #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> > 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
> > 464 in ompi_mpi_init.c
> > #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> > 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
> > #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> > 0x07fef348) at line 8 in mpiinitfinalize.c
> > #11 0x00d2b81c (__libc_start_main + 0x194) 
> > (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> > #12 0x0010094c (_start + 0x2c) ()
> >
> > The line 252 in opal/mca/db/hash/db_hash.c is:
> >
> > case OPAL_UINT64:
> > if (NULL == data) {
> > OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
> > return OPAL_ERR_BAD_PARAM;
> > }
> > kv->type = OPAL_UINT64;
> > kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
> > break;
> >
> > My environment is:
> >
> >   Open MPI v1.8 branch r32447 (latest)
> >   configure --enable-debug
> >   SPARC-V9 (Fujitsu SPARC64 IXfx)
> >   Linux (custom)
> >   gcc 4.2.4
> >
> > I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
> >
> > Can this information help?
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Hi,
> >>
> >> I'm sorry once more to answer late, but the last two days our mail
> >> server was down (hardware error).
> >>
> >>> Did you configure this --enable-debug?
> >> Yes, I 

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san and all,

Here is attached a one off patch for v1.8.
/* it does not use the __attribute__ modifier that might not be
supported by all compilers */

as far as i am concerned, the same issue is also in the trunk,
and if you do not hit it, it just means you are lucky :-)

the same issue might also be in other parts of the code :-(

Cheers,

Gilles

On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> Gilles, George,
>
> The problem is the one Gilles pointed.
> I temporarily modified the code bellow and the bus error disappeared.
>
> --- orte/util/nidmap.c  (revision 32447)
> +++ orte/util/nidmap.c  (working copy)
> @@ -885,7 +885,7 @@
>  orte_proc_state_t state;
>  orte_app_idx_t app_idx;
>  int32_t restarts;
> -orte_process_name_t proc, dmn;
> +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
>  char *hostname;
>  uint8_t flag;
>  opal_buffer_t *bptr;
>
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
>> Kawashima-san,
>>
>> This is interesting :-)
>>
>> proc is in the stack and has type orte_process_name_t
>>
>> with
>>
>> typedef uint32_t orte_jobid_t;
>> typedef uint32_t orte_vpid_t;
>> struct orte_process_name_t {
>> orte_jobid_t jobid; /**< Job number */
>> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
>> };
>> typedef struct orte_process_name_t orte_process_name_t;
>>
>>
>> so there is really no reason to align this on 8 bytes...
>> but later, proc is casted into an uint64_t ...
>> so proc should have been aligned on 8 bytes but it is too late,
>> and hence the glory SIGBUS
>>
>>
>> this is loosely related to
>> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
>> (see heterogeneous.v2.patch)
>> if we make opal_process_name_t an union of uint64_t and a struct of two
>> uint32_t, the compiler
>> will align this on 8 bytes.
>> note the patch is not enough (and will not apply on the v1.8 branch anyway),
>> we could simply remove orte_process_name_t and ompi_process_name_t and
>> use only
>> opal_process_name_t (and never declare variables with type
>> opal_proc_name_t otherwise alignment might be incorrect)
>>
>> as a workaround, you can declare an opal_process_name_t (for alignment),
>> and cast it to an orte_process_name_t
>>
>> i will write a patch (i will not be able to test on sparc ...)
>> please note this issue might be present in other places
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
>>> Hi,
>>>
 I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
 10 Sparc and I receive a bus error, if I run a small program.
>>> I've finally reproduced the bus error in my SPARC environment.
>>>
>>> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
>>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
>>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
>>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in 
>>> ../sigattach.c 
>>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
>>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
>>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
>>> 252 in db_hash.c
>>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
>>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
>>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line 
>>> 49 in db_base_fns.c
>>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
>>> 0x00281d70) at line 975 in nidmap.c
>>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
>>> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
>>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c
>>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
>>> 0x,pargv=(char ***) 0x,flags=32) at line 
>>> 148 in orte_init.c
>>> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
>>> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at line 
>>> 464 in ompi_mpi_init.c
>>> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
>>> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in init.c
>>> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
>>> 0x07fef348) at line 8 in mpiinitfinalize.c
>>> #11 0x00d2b81c (__libc_start_main + 0x194) 
>>> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
>>> #12 0x0010094c (_start + 0x2c) ()
>>>
>>> The line 252 in opal/mca/db/hash/db_hash.c is:
>>>
>>> case OPAL_UINT64:
>>> if (NULL == data) {
>>> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
>>> return OPAL_ERR_BAD_PARAM;
>>> }
>>> kv->type = OPAL_UINT64;
>>> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
>>> break;
>>>
>>> My environment is:
>>>
>>>   Open MPI v1.8 branch r32447 (lates

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Kawashima, Takahiro
Gilles,

I applied your patch to v1.8 and it run successfully
on my SPARC machines.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Kawashima-san and all,
> 
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
> 
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
> 
> the same issue might also be in other parts of the code :-(
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> > Gilles, George,
> >
> > The problem is the one Gilles pointed.
> > I temporarily modified the code bellow and the bus error disappeared.
> >
> > --- orte/util/nidmap.c  (revision 32447)
> > +++ orte/util/nidmap.c  (working copy)
> > @@ -885,7 +885,7 @@
> >  orte_proc_state_t state;
> >  orte_app_idx_t app_idx;
> >  int32_t restarts;
> > -orte_process_name_t proc, dmn;
> > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
> >  char *hostname;
> >  uint8_t flag;
> >  opal_buffer_t *bptr;
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Kawashima-san,
> >>
> >> This is interesting :-)
> >>
> >> proc is in the stack and has type orte_process_name_t
> >>
> >> with
> >>
> >> typedef uint32_t orte_jobid_t;
> >> typedef uint32_t orte_vpid_t;
> >> struct orte_process_name_t {
> >> orte_jobid_t jobid; /**< Job number */
> >> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> >> };
> >> typedef struct orte_process_name_t orte_process_name_t;
> >>
> >>
> >> so there is really no reason to align this on 8 bytes...
> >> but later, proc is casted into an uint64_t ...
> >> so proc should have been aligned on 8 bytes but it is too late,
> >> and hence the glory SIGBUS
> >>
> >>
> >> this is loosely related to
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> >> (see heterogeneous.v2.patch)
> >> if we make opal_process_name_t an union of uint64_t and a struct of two
> >> uint32_t, the compiler
> >> will align this on 8 bytes.
> >> note the patch is not enough (and will not apply on the v1.8 branch 
> >> anyway),
> >> we could simply remove orte_process_name_t and ompi_process_name_t and
> >> use only
> >> opal_process_name_t (and never declare variables with type
> >> opal_proc_name_t otherwise alignment might be incorrect)
> >>
> >> as a workaround, you can declare an opal_process_name_t (for alignment),
> >> and cast it to an orte_process_name_t
> >>
> >> i will write a patch (i will not be able to test on sparc ...)
> >> please note this issue might be present in other places
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> >>> Hi,
> >>>
>  I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>  10 Sparc and I receive a bus error, if I run a small program.
> >>> I've finally reproduced the bus error in my SPARC environment.
> >>>
> >>> #0 0x00db4740 (__waitpid_nocancel + 0x44) 
> >>> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct 
> >>> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 
> >>> in ../sigattach.c 
> >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *) 
> >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> >>> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line 
> >>> 252 in db_hash.c
> >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) 
> >>> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8 
> >>> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at 
> >>> line 49 in db_base_fns.c
> >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) 
> >>> 0x00281d70) at line 975 in nidmap.c
> >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct 
> >>> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in 
> >>> ess_env_module.c
> >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *) 
> >>> 0x,pargv=(char ***) 0x,flags=32) at line 
> >>> 148 in orte_init.c
> >>> #8 0x001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) 
> >>> 0x07fef348,requested=0,provided=(int *) 0x07fee698) at 
> >>> line 464 in ompi_mpi_init.c
> >>> #9 0x001ff79c (MPI_Init + 0x2b0) (argc=(int *) 
> >>> 0x07fee814,argv=(char ***) 0x07fee818) at line 84 in 
> >>> init.c
> >>> #10 0x00100ae4 (main + 0x44) (argc=1,argv=(char **) 
> >>> 0x07fef348) at line 8 in mpiinitfinalize.c
> >>> #11 0x00d2b81c (__libc_start_main + 0x194) 
> >>> (0x100aa0,0x1,0x7fef348,0x100d24,0x100d14,0x0)
> >>> #12 0x0010094c (_start + 0x2c) ()
> >>>
> >>> The line 2

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
This is a gigantic patch for an almost trivial issue. The current problem
is purely related to the fact that in a single location (nidmap.c) the
orte_process_name_t (which is a structure of 2 integers) is supposed to be
aligned based on the uint64_t requirements. Bad assumption!

Looking at the code one might notice that the orte_process_name_t is stored
using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
on the SPARC architecture because the two types (int32_t and int64_t) have
different alignments.  However, ORTE define a type for orte_process_name_t.
Thus, I think that if instead of saving the orte_process_name_t as an
OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.

  George.



On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Kawashima-san and all,
>
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
>
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
>
> the same issue might also be in other parts of the code :-(
>
> Cheers,
>
> Gilles
>
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
> > Gilles, George,
> >
> > The problem is the one Gilles pointed.
> > I temporarily modified the code bellow and the bus error disappeared.
> >
> > --- orte/util/nidmap.c  (revision 32447)
> > +++ orte/util/nidmap.c  (working copy)
> > @@ -885,7 +885,7 @@
> >  orte_proc_state_t state;
> >  orte_app_idx_t app_idx;
> >  int32_t restarts;
> > -orte_process_name_t proc, dmn;
> > +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
> >  char *hostname;
> >  uint8_t flag;
> >  opal_buffer_t *bptr;
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Kawashima-san,
> >>
> >> This is interesting :-)
> >>
> >> proc is in the stack and has type orte_process_name_t
> >>
> >> with
> >>
> >> typedef uint32_t orte_jobid_t;
> >> typedef uint32_t orte_vpid_t;
> >> struct orte_process_name_t {
> >> orte_jobid_t jobid; /**< Job number */
> >> orte_vpid_t vpid;   /**< Process id - equivalent to rank */
> >> };
> >> typedef struct orte_process_name_t orte_process_name_t;
> >>
> >>
> >> so there is really no reason to align this on 8 bytes...
> >> but later, proc is casted into an uint64_t ...
> >> so proc should have been aligned on 8 bytes but it is too late,
> >> and hence the glory SIGBUS
> >>
> >>
> >> this is loosely related to
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
> >> (see heterogeneous.v2.patch)
> >> if we make opal_process_name_t an union of uint64_t and a struct of two
> >> uint32_t, the compiler
> >> will align this on 8 bytes.
> >> note the patch is not enough (and will not apply on the v1.8 branch
> anyway),
> >> we could simply remove orte_process_name_t and ompi_process_name_t and
> >> use only
> >> opal_process_name_t (and never declare variables with type
> >> opal_proc_name_t otherwise alignment might be incorrect)
> >>
> >> as a workaround, you can declare an opal_process_name_t (for alignment),
> >> and cast it to an orte_process_name_t
> >>
> >> i will write a patch (i will not be able to test on sparc ...)
> >> please note this issue might be present in other places
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> >>> Hi,
> >>>
>  I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>  10 Sparc and I receive a bus error, if I run a small program.
> >>> I've finally reproduced the bus error in my SPARC environment.
> >>>
> >>> #0 0x00db4740 (__waitpid_nocancel + 0x44)
> (0x200,0x0,0x0,0xa0,0xf80100064af0,0x35b4)
> >>> #1 0x0001a310 (handle_signal + 0x574) (signo=10,info=(struct
> siginfo *) 0x07fed100,p=(void *) 0x07fed100) at line 277 in
> ../sigattach.c 
> >>> #2 0x0282aff4 (store + 0x540) (uid=(unsigned long *)
> 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",data=(void *) 0x07fede74,type=15:'\017') at line
> 252 in db_hash.c
> >>> #3 0x01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long
> *) 0x0118a128,scope=8:'\b',key=(char *) 0x0106a0a8
> "opal.local.ldr",object=(void *) 0x07fede74,type=15:'\017') at line
> 49 in db_base_fns.c
> >>> #4 0x00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
> 0x00281d70) at line 975 in nidmap.c
> >>> #5 0x00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
> opal_buffer_t *) 0x00241fc0) at line 141 in nidmap.c
> >>> #6 0x01e298cc (rte_init + 0x2a0) () at line 153 in
> ess_env_module.c
> >>> #7 0x00f9f28c (orte_init + 0x308) (pargc=(int *)
> 0x,pargv=(char ***) 0x,flags=32) at line
> 148 in orte_init.c
> >>> #8 0x001a6f08 (ompi_mpi_ini

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
George,

(one of the) faulty line was :

   if (ORTE_SUCCESS != (rc =
opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,

OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {

so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix
the issue (i have no arch to test...)

i was initially also "confused" with the following line

if (ORTE_SUCCESS != (rc =
opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL,
ORTE_DB_NPROC_OFFSET,
&offset, OPAL_UINT32))) {

the first argument of store is an (opal_identifier_t *)
strictly speaking this is "a pointer to a 64 bits aligned address", and
proc might not be 64 bits aligned.
/* that being said, there is no crash :-) */

in this case, opal_db.store pointer points to the store function
(db_hash.c:178)
and proc is only used id memcpy at line 194, so 64 bits alignment is not
required.
(and comment is explicit :/* to protect alignment, copy the data across */

that might sounds pedantic, but are we doing the right thing here ?
(e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the
pointer was not 64 bits aligned
vs always use aligned data ?)

Cheers,

Gilles

On 2014/08/08 14:58, George Bosilca wrote:
> This is a gigantic patch for an almost trivial issue. The current problem
> is purely related to the fact that in a single location (nidmap.c) the
> orte_process_name_t (which is a structure of 2 integers) is supposed to be
> aligned based on the uint64_t requirements. Bad assumption!
>
> Looking at the code one might notice that the orte_process_name_t is stored
> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
> on the SPARC architecture because the two types (int32_t and int64_t) have
> different alignments.  However, ORTE define a type for orte_process_name_t.
> Thus, I think that if instead of saving the orte_process_name_t as an
> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>
>   George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Kawashima-san and all,
>>
>> Here is attached a one off patch for v1.8.
>> /* it does not use the __attribute__ modifier that might not be
>> supported by all compilers */
>>
>> as far as i am concerned, the same issue is also in the trunk,
>> and if you do not hit it, it just means you are lucky :-)
>>
>> the same issue might also be in other parts of the code :-(
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
>>> Gilles, George,
>>>
>>> The problem is the one Gilles pointed.
>>> I temporarily modified the code bellow and the bus error disappeared.
>>>
>>> --- orte/util/nidmap.c  (revision 32447)
>>> +++ orte/util/nidmap.c  (working copy)
>>> @@ -885,7 +885,7 @@
>>>  orte_proc_state_t state;
>>>  orte_app_idx_t app_idx;
>>>  int32_t restarts;
>>> -orte_process_name_t proc, dmn;
>>> +orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
>>>  char *hostname;
>>>  uint8_t flag;
>>>  opal_buffer_t *bptr;
>>>
>>> Takahiro Kawashima,
>>> MPI development team,
>>> Fujitsu
>>>
 Kawashima-san,

 This is interesting :-)

 proc is in the stack and has type orte_process_name_t

 with

 typedef uint32_t orte_jobid_t;
 typedef uint32_t orte_vpid_t;
 struct orte_process_name_t {
 orte_jobid_t jobid; /**< Job number */
 orte_vpid_t vpid;   /**< Process id - equivalent to rank */
 };
 typedef struct orte_process_name_t orte_process_name_t;


 so there is really no reason to align this on 8 bytes...
 but later, proc is casted into an uint64_t ...
 so proc should have been aligned on 8 bytes but it is too late,
 and hence the glory SIGBUS


 this is loosely related to
 http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
 (see heterogeneous.v2.patch)
 if we make opal_process_name_t an union of uint64_t and a struct of two
 uint32_t, the compiler
 will align this on 8 bytes.
 note the patch is not enough (and will not apply on the v1.8 branch
>> anyway),
 we could simply remove orte_process_name_t and ompi_process_name_t and
 use only
 opal_process_name_t (and never declare variables with type
 opal_proc_name_t otherwise alignment might be incorrect)

 as a workaround, you can declare an opal_process_name_t (for alignment),
 and cast it to an orte_process_name_t

 i will write a patch (i will not be able to test on sparc ...)
 please note this issue might be present in other places

 Cheers,

 Gilles

 On 2014/08/08 13:03, Kawashima, Takahiro wrote:
> Hi,
>
>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>> 10 Sparc and I receive a 

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Sorry to chime in a little late. George is likely correct about using 
ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
datatype looks like. This was the original reason for creating the 
opal_identifier_t type - I had no other choice when we moved the db framework 
(now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. The 
abstraction requirement wouldn't allow me to pass down the structure definition.

The easiest solution is probably to change the opal/db/hash code so that 64-bit 
fields are memcpy'd instead of simply passed by "=". This should eliminate the 
problem with the least fuss.

There is a performance penalty for using non-aligned data, and ideally we 
should use aligned data whenever possible. This code isn't in the critical path 
and so this is less of an issue, but still would be nice to do. However, I 
didn't do so for the following reasons:

* I couldn't find a way for the compiler to check/require alignment down in 
opal_db.store when passed a parameter. If someone knows of a way to do that, 
please feel free to suggest it

* none of our current developers have access to a Solaris SPARC machine, and 
thus our developers cannot detect violations when they occur

* the current solution avoids the issue, albeit with a slight performance 
penalty

I'm open to alternative methods - I'm not happy with the ugliness this 
required, but couldn't come up with a cleaner solution that would be easy for 
developers to know when they violated the alignment requirement.

FWIW: it is possible, I suppose, that the other discussion about using an 
opal_process_name_t that exactly mirrors orte_process_name_t could also resolve 
this problem in a cleaner fashion. I didn't impose that requirement here, but 
maybe it's another motivator for doing so?

Ralph


On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
 wrote:

> George,
> 
> (one of the) faulty line was :
> 
>if (ORTE_SUCCESS != (rc = 
> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
> OPAL_DB_LOCALLDR, 
> (opal_identifier_t*)&proc, OPAL_ID_T))) {
> 
> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
> issue (i have no arch to test...)
> 
> i was initially also "confused" with the following line
> 
> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
> OPAL_SCOPE_INTERNAL,
> ORTE_DB_NPROC_OFFSET, 
> &offset, OPAL_UINT32))) {
> 
> the first argument of store is an (opal_identifier_t *)
> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
> might not be 64 bits aligned.
> /* that being said, there is no crash :-) */
> 
> in this case, opal_db.store pointer points to the store function 
> (db_hash.c:178)
> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
> required.
> (and comment is explicit : /* to protect alignment, copy the data across */
> 
> that might sounds pedantic, but are we doing the right thing here ?
> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
> pointer was not 64 bits aligned
> vs always use aligned data ?)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 14:58, George Bosilca wrote:
>> This is a gigantic patch for an almost trivial issue. The current problem
>> is purely related to the fact that in a single location (nidmap.c) the
>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>> aligned based on the uint64_t requirements. Bad assumption!
>> 
>> Looking at the code one might notice that the orte_process_name_t is stored
>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>> on the SPARC architecture because the two types (int32_t and int64_t) have
>> different alignments.  However, ORTE define a type for orte_process_name_t.
>> Thus, I think that if instead of saving the orte_process_name_t as an
>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>> 
>>   George.
>> 
>> 
>> 
>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> 
>>> Kawashima-san and all,
>>> 
>>> Here is attached a one off patch for v1.8.
>>> /* it does not use the __attribute__ modifier that might not be
>>> supported by all compilers */
>>> 
>>> as far as i am concerned, the same issue is also in the trunk,
>>> and if you do not hit it, it just means you are lucky :-)
>>> 
>>> the same issue might also be in other parts of the code :-(
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
 Gilles, George,
 
 The problem is the one Gilles pointed.
 I temporarily modified the code bellow and the bus error disappeared.
 
 --- orte/util/nidmap.c  (revision 32447)
 +++ orte/util/nidmap.c  (working copy)

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Committed a fix for this in r32459 - please check and see if this resolves the 
issue.


On Aug 8, 2014, at 2:21 AM, Ralph Castain  wrote:

> Sorry to chime in a little late. George is likely correct about using 
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
> datatype looks like. This was the original reason for creating the 
> opal_identifier_t type - I had no other choice when we moved the db framework 
> (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. 
> The abstraction requirement wouldn't allow me to pass down the structure 
> definition.
> 
> The easiest solution is probably to change the opal/db/hash code so that 
> 64-bit fields are memcpy'd instead of simply passed by "=". This should 
> eliminate the problem with the least fuss.
> 
> There is a performance penalty for using non-aligned data, and ideally we 
> should use aligned data whenever possible. This code isn't in the critical 
> path and so this is less of an issue, but still would be nice to do. However, 
> I didn't do so for the following reasons:
> 
> * I couldn't find a way for the compiler to check/require alignment down in 
> opal_db.store when passed a parameter. If someone knows of a way to do that, 
> please feel free to suggest it
> 
> * none of our current developers have access to a Solaris SPARC machine, and 
> thus our developers cannot detect violations when they occur
> 
> * the current solution avoids the issue, albeit with a slight performance 
> penalty
> 
> I'm open to alternative methods - I'm not happy with the ugliness this 
> required, but couldn't come up with a cleaner solution that would be easy for 
> developers to know when they violated the alignment requirement.
> 
> FWIW: it is possible, I suppose, that the other discussion about using an 
> opal_process_name_t that exactly mirrors orte_process_name_t could also 
> resolve this problem in a cleaner fashion. I didn't impose that requirement 
> here, but maybe it's another motivator for doing so?
> 
> Ralph
> 
> 
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
>  wrote:
> 
>> George,
>> 
>> (one of the) faulty line was :
>> 
>>if (ORTE_SUCCESS != (rc = 
>> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>> 
>> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>> 
>> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
>> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
>> issue (i have no arch to test...)
>> 
>> i was initially also "confused" with the following line
>> 
>> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
>> OPAL_SCOPE_INTERNAL,
>> ORTE_DB_NPROC_OFFSET, 
>> &offset, OPAL_UINT32))) {
>> 
>> the first argument of store is an (opal_identifier_t *)
>> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
>> might not be 64 bits aligned.
>> /* that being said, there is no crash :-) */
>> 
>> in this case, opal_db.store pointer points to the store function 
>> (db_hash.c:178)
>> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
>> required.
>> (and comment is explicit : /* to protect alignment, copy the data across */
>> 
>> that might sounds pedantic, but are we doing the right thing here ?
>> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
>> pointer was not 64 bits aligned
>> vs always use aligned data ?)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/08 14:58, George Bosilca wrote:
>>> This is a gigantic patch for an almost trivial issue. The current problem
>>> is purely related to the fact that in a single location (nidmap.c) the
>>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>>> aligned based on the uint64_t requirements. Bad assumption!
>>> 
>>> Looking at the code one might notice that the orte_process_name_t is stored
>>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>>> on the SPARC architecture because the two types (int32_t and int64_t) have
>>> different alignments.  However, ORTE define a type for orte_process_name_t.
>>> Thus, I think that if instead of saving the orte_process_name_t as an
>>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>> 
 Kawashima-san and all,
 
 Here is attached a one off patch for v1.8.
 /* it does not use the __attribute__ modifier that might not be
 supported by all compilers */
 
 as far as i am concerned, the same issue is also in the trunk,
 and if you do not hit it, it just means you are lucky :-)
 
 the same issue might also be in other parts of the code :-(
 
 Cheers,
 
 Gilles

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread George Bosilca
On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain  wrote:

> Sorry to chime in a little late. George is likely correct about using
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that
> datatype looks like. This was the original reason for creating the
> opal_identifier_t type - I had no other choice when we moved the db
> framework (now dstore) to the OPAL layer in anticipation of the BTLs moving
> to OPAL. The abstraction requirement wouldn't allow me to pass down the
> structure definition.
>

We are talking about nidmap.c which has not yet been moved down to OPAL.

  George.


>
> The easiest solution is probably to change the opal/db/hash code so that
> 64-bit fields are memcpy'd instead of simply passed by "=". This should
> eliminate the problem with the least fuss.
>
> There is a performance penalty for using non-aligned data, and ideally we
> should use aligned data whenever possible. This code isn't in the critical
> path and so this is less of an issue, but still would be nice to do.
> However, I didn't do so for the following reasons:
>
> * I couldn't find a way for the compiler to check/require alignment down
> in opal_db.store when passed a parameter. If someone knows of a way to do
> that, please feel free to suggest it
>
> * none of our current developers have access to a Solaris SPARC machine,
> and thus our developers cannot detect violations when they occur
>
> * the current solution avoids the issue, albeit with a slight performance
> penalty
>
> I'm open to alternative methods - I'm not happy with the ugliness this
> required, but couldn't come up with a cleaner solution that would be easy
> for developers to know when they violated the alignment requirement.
>
> FWIW: it is possible, I suppose, that the other discussion about using an
> opal_process_name_t that exactly mirrors orte_process_name_t could also
> resolve this problem in a cleaner fashion. I didn't impose that requirement
> here, but maybe it's another motivator for doing so?
>
> Ralph
>
>
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>  George,
>
> (one of the) faulty line was :
>
>if (ORTE_SUCCESS != (rc =
> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>
> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>
> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix
> the issue (i have no arch to test...)
>
> i was initially also "confused" with the following line
>
> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc,
> OPAL_SCOPE_INTERNAL,
> ORTE_DB_NPROC_OFFSET,
> &offset, OPAL_UINT32))) {
>
> the first argument of store is an (opal_identifier_t *)
> strictly speaking this is "a pointer to a 64 bits aligned address", and
> proc might not be 64 bits aligned.
> /* that being said, there is no crash :-) */
>
> in this case, opal_db.store pointer points to the store function
> (db_hash.c:178)
> and proc is only used id memcpy at line 194, so 64 bits alignment is not
> required.
> (and comment is explicit : /* to protect alignment, copy the data across
> */
>
> that might sounds pedantic, but are we doing the right thing here ?
> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the
> pointer was not 64 bits aligned
> vs always use aligned data ?)
>
> Cheers,
>
> Gilles
>
> On 2014/08/08 14:58, George Bosilca wrote:
>
> This is a gigantic patch for an almost trivial issue. The current problem
> is purely related to the fact that in a single location (nidmap.c) the
> orte_process_name_t (which is a structure of 2 integers) is supposed to be
> aligned based on the uint64_t requirements. Bad assumption!
>
> Looking at the code one might notice that the orte_process_name_t is stored
> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
> on the SPARC architecture because the two types (int32_t and int64_t) have
> different alignments.  However, ORTE define a type for orte_process_name_t.
> Thus, I think that if instead of saving the orte_process_name_t as an
> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>
>   George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet 
>  wrote:
>
>
>  Kawashima-san and all,
>
> Here is attached a one off patch for v1.8.
> /* it does not use the __attribute__ modifier that might not be
> supported by all compilers */
>
> as far as i am concerned, the same issue is also in the trunk,
> and if you do not hit it, it just means you are lucky :-)
>
> the same issue might also be in other parts of the code :-(
>
> Cheers,
>
> Gilles
>
> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
>
>  Gilles, George,
>
> The problem is the one Gilles pointed.
> I temporarily modified the code bellow and the bus error disappeared.
>
> --- orte/util/nidmap.c  (revision 32447

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Ralph Castain
Yes, I know - but the problem comes from nidmap pushing data down into the 
opal_db/dstore level, which then creates a copy of the data. That's where the 
alignment error is generated


On Aug 8, 2014, at 11:17 AM, George Bosilca  wrote:

> On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain  wrote:
> Sorry to chime in a little late. George is likely correct about using 
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
> datatype looks like. This was the original reason for creating the 
> opal_identifier_t type - I had no other choice when we moved the db framework 
> (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. 
> The abstraction requirement wouldn't allow me to pass down the structure 
> definition.
> 
> We are talking about nidmap.c which has not yet been moved down to OPAL. 
> 
>   George.
>  
> 
> The easiest solution is probably to change the opal/db/hash code so that 
> 64-bit fields are memcpy'd instead of simply passed by "=". This should 
> eliminate the problem with the least fuss.
> 
> There is a performance penalty for using non-aligned data, and ideally we 
> should use aligned data whenever possible. This code isn't in the critical 
> path and so this is less of an issue, but still would be nice to do. However, 
> I didn't do so for the following reasons:
> 
> * I couldn't find a way for the compiler to check/require alignment down in 
> opal_db.store when passed a parameter. If someone knows of a way to do that, 
> please feel free to suggest it
> 
> * none of our current developers have access to a Solaris SPARC machine, and 
> thus our developers cannot detect violations when they occur
> 
> * the current solution avoids the issue, albeit with a slight performance 
> penalty
> 
> I'm open to alternative methods - I'm not happy with the ugliness this 
> required, but couldn't come up with a cleaner solution that would be easy for 
> developers to know when they violated the alignment requirement.
> 
> FWIW: it is possible, I suppose, that the other discussion about using an 
> opal_process_name_t that exactly mirrors orte_process_name_t could also 
> resolve this problem in a cleaner fashion. I didn't impose that requirement 
> here, but maybe it's another motivator for doing so?
> 
> Ralph
> 
> 
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
>  wrote:
> 
>> George,
>> 
>> (one of the) faulty line was :
>> 
>>if (ORTE_SUCCESS != (rc = 
>> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>> 
>> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>> 
>> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
>> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
>> issue (i have no arch to test...)
>> 
>> i was initially also "confused" with the following line
>> 
>> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
>> OPAL_SCOPE_INTERNAL,
>> ORTE_DB_NPROC_OFFSET, 
>> &offset, OPAL_UINT32))) {
>> 
>> the first argument of store is an (opal_identifier_t *)
>> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
>> might not be 64 bits aligned.
>> /* that being said, there is no crash :-) */
>> 
>> in this case, opal_db.store pointer points to the store function 
>> (db_hash.c:178)
>> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
>> required.
>> (and comment is explicit : /* to protect alignment, copy the data across */
>> 
>> that might sounds pedantic, but are we doing the right thing here ?
>> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
>> pointer was not 64 bits aligned
>> vs always use aligned data ?)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/08 14:58, George Bosilca wrote:
>>> This is a gigantic patch for an almost trivial issue. The current problem
>>> is purely related to the fact that in a single location (nidmap.c) the
>>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>>> aligned based on the uint64_t requirements. Bad assumption!
>>> 
>>> Looking at the code one might notice that the orte_process_name_t is stored
>>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>>> on the SPARC architecture because the two types (int32_t and int64_t) have
>>> different alignments.  However, ORTE define a type for orte_process_name_t.
>>> Thus, I think that if instead of saving the orte_process_name_t as an
>>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>> 
 Kawashima-san and all,
 
 Here is attached a one off patch for v1.8.
 /* it does not use the __attribute__ modifier that might not be
 supported by all compi