Re: [OMPI devel] 1.10.0rc2

2015-07-23 Thread Burette, Yohann
Paul,

While looking at the issue, we noticed that we were missing some code that 
deals with MTL priorities.

PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is attempting to fix 
that.

Hopefully, this will also fix the error you encountered.

Thanks again,
Yohann

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Hargrove
Sent: Wednesday, July 22, 2015 12:07 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.10.0rc2

Yohann,

Things run fine with those additional flags.
In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV.

-Paul

On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann 
mailto:yohann.bure...@intel.com>> wrote:
Hi Paul,

Thank you for doing all this testing!

About 1), it’s hard for me to see whether it’s a problem with mtl:ofi or with 
how OMPI selects the components to use.
Could you please run your test again with “--mca mtl ofi --mca mtl_ofi_provider 
sockets --mca pml cm”?
The idea is that if it still fails, then we have a problem with either mtl:ofi 
or the OFI/sockets provider. If it works, then there is an issue with how OMPI 
selects what component to use.

I just tried 1.10.0rc2 with the latest libfabric (master) and it seems to work 
fine.

Yohann

From: devel 
[mailto:devel-boun...@open-mpi.org] On 
Behalf Of Paul Hargrove
Sent: Wednesday, July 22, 2015 1:05 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.10.0rc2

1.10.0rc2 looks mostly good to me, but I still found some issues.


1) New to this round of testing, I have built mtl:ofi with gcc, pgi, icc, 
clang, open64 and studio compilers.
I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2).
However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a callback 
invoked in opal_progress().
Gdb did not give a function name for the  callback, but the PC looks valid.


2) Of the several compilers I tried, only pgi-13.0 failed to compile mtl:ofi:

/bin/sh ../../../../libtool  --tag=CC   --mode=compile pgcc 
-DHAVE_CONFIG_H -I. 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi
 -I../../../../opal/include -I../../../../orte/include 
-I../../../../ompi/include -I../../../../oshmem/include 
-I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen 
-I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen  
-I/usr/common/ftg/libfabric/1.1.0rc2p1/include 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2
 -I../../../.. 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include
   
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include
  -g  -c -o mtl_ofi_component.lo 
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c
libtool: compile:  pgcc -DHAVE_CONFIG_H -I. 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi
 -I../../../../opal/include -I../../../../orte/include 
-I../../../../ompi/include -I../../../../oshmem/include 
-I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen 
-I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen 
-I/usr/common/ftg/libfabric/1.1.0rc2p1/include 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2
 -I../../../.. 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include
 
-I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-lin

Re: [OMPI devel] 1.10.0rc2

2015-07-23 Thread Paul Hargrove
Yohann,

With PR409 as it stands right now (commit 6daef310) I see no change to the
behavior.
I still get a SEGV below opal_progress() unless I use either
   -mca mtl ^ofi
OR
   -mca pml cm

A backtrace from gdb appears below.

-Paul

(gdb) where
#0  0x7f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1
#1  0x7f5bc7b5a119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x7f5bcc9b08f6 in __backtrace (array=, size=32)
at ../sysdeps/ia64/backtrace.c:110
#3  0x7f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880,
prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2)
at
/scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47
#4  0x7f5bcc3456a9 in show_stackframe (signo=11, info=0x7fff6181d770,
p=0x7fff6181d640)
at
/scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336
#5  
#6  0x7f5bc7717c58 in ?? ()
#7  0x7f5bcc2f567a in opal_progress ()
at
/scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187
#8  0x7f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78,
requested=0, provided=0x7fff6181dbf8)
at
/scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645
#9  0x7f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c,
argv=0x7fff6181dc50) at pinit.c:84
#10 0x0040088e in main (argc=1, argv=0x7fff6181dd78) at ring_c.c:19

(gdb) up 6
#6  0x7f5bc7717c58 in ?? ()
(gdb) disass
No function contains program counter for selected frame.

On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann 
wrote:

>  Paul,
>
>
>
> While looking at the issue, we noticed that we were missing some code that
> deals with MTL priorities.
>
>
>
> PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is attempting
> to fix that.
>
>
>
> Hopefully, this will also fix the error you encountered.
>
>
>
> Thanks again,
>
> Yohann
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul
> Hargrove
> *Sent:* Wednesday, July 22, 2015 12:07 PM
>
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] 1.10.0rc2
>
>
>
> Yohann,
>
>
>
> Things run fine with those additional flags.
>
> In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV.
>
>
>
> -Paul
>
>
>
> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann 
> wrote:
>
>  Hi Paul,
>
>
>
> Thank you for doing all this testing!
>
>
>
> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi or
> with how OMPI selects the components to use.
>
> Could you please run your test again with “--mca mtl ofi --mca
> mtl_ofi_provider sockets --mca pml cm”?
>
> The idea is that if it still fails, then we have a problem with either
> mtl:ofi or the OFI/sockets provider. If it works, then there is an issue
> with how OMPI selects what component to use.
>
>
>
> I just tried 1.10.0rc2 with the latest libfabric (master) and it seems to
> work fine.
>
>
>
> Yohann
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul
> Hargrove
> *Sent:* Wednesday, July 22, 2015 1:05 AM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] 1.10.0rc2
>
>
>
> 1.10.0rc2 looks mostly good to me, but I still found some issues.
>
>
>
>
>
> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi, icc,
> clang, open64 and studio compilers.
>
> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2).
>
> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a
> callback invoked in opal_progress().
>
> Gdb did not give a function name for the  callback, but the PC looks valid.
>
>
>
>
>
> 2) Of the several compilers I tried, only pgi-13.0 failed to compile
> mtl:ofi:
>
>
>
> /bin/sh ../../../../libtool  --tag=CC   --mode=compile pgcc
> -DHAVE_CONFIG_H -I.
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi
> -I../../../../opal/include -I../../../../orte/include
> -I../../../../ompi/include -I../../../../oshmem/include
> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen
> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen
>  -I/usr/common/ftg/libfabric/1.1.0rc2p1/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2
> -I../../../..
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include
>
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.1

Re: [OMPI devel] 1.10.0rc2

2015-07-23 Thread Ralph Castain
It looks like one of the MTL components is registering a progress call with the 
opal_progress thread, and then unloading when de-selected. Registering with 
opal_progress should only be done once the MTL has been selected and will run


> On Jul 23, 2015, at 5:05 PM, Paul Hargrove  wrote:
> 
> Yohann,
> 
> With PR409 as it stands right now (commit 6daef310) I see no change to the 
> behavior.
> I still get a SEGV below opal_progress() unless I use either
>-mca mtl ^ofi
> OR
>-mca pml cm
> 
> A backtrace from gdb appears below.
> 
> -Paul
> 
> (gdb) where
> #0  0x7f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1
> #1  0x7f5bc7b5a119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #2  0x7f5bcc9b08f6 in __backtrace (array=, size=32)
> at ../sysdeps/ia64/backtrace.c:110
> #3  0x7f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880, 
> prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2)
> at 
> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47
> #4  0x7f5bcc3456a9 in show_stackframe (signo=11, info=0x7fff6181d770, 
> p=0x7fff6181d640)
> at 
> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336
> #5  
> #6  0x7f5bc7717c58 in ?? ()
> #7  0x7f5bcc2f567a in opal_progress ()
> at 
> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187
> #8  0x7f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78, 
> requested=0, provided=0x7fff6181dbf8)
> at 
> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645
> #9  0x7f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c, 
> argv=0x7fff6181dc50) at pinit.c:84
> #10 0x0040088e in main (argc=1, argv=0x7fff6181dd78) at ring_c.c:19
> 
> (gdb) up 6
> #6  0x7f5bc7717c58 in ?? ()
> (gdb) disass
> No function contains program counter for selected frame.
> 
> On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann  > wrote:
> Paul,
> 
>  
> 
> While looking at the issue, we noticed that we were missing some code that 
> deals with MTL priorities.
> 
>  
> 
> PR 409 (https://github.com/open-mpi/ompi-release/pull/409 
> ) is attempting to fix 
> that.
> 
>  
> 
> Hopefully, this will also fix the error you encountered.
> 
>  
> 
> Thanks again,
> 
> Yohann
> 
>   <>
> From: devel [mailto:devel-boun...@open-mpi.org 
> ] On Behalf Of Paul Hargrove
> Sent: Wednesday, July 22, 2015 12:07 PM
> 
> 
> To: Open MPI Developers
> Subject: Re: [OMPI devel] 1.10.0rc2
> 
>  
> 
> Yohann,
> 
>  
> 
> Things run fine with those additional flags.
> 
> In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV.
> 
>  
> 
> -Paul
> 
>  
> 
> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann  > wrote:
> 
> Hi Paul,
> 
>  
> 
> Thank you for doing all this testing!
> 
>  
> 
> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi or with 
> how OMPI selects the components to use.
> 
> Could you please run your test again with “--mca mtl ofi --mca 
> mtl_ofi_provider sockets --mca pml cm”?
> 
> The idea is that if it still fails, then we have a problem with either 
> mtl:ofi or the OFI/sockets provider. If it works, then there is an issue with 
> how OMPI selects what component to use.
> 
>   <>
> I just tried 1.10.0rc2 with the latest libfabric (master) and it seems to 
> work fine.
> 
>  
> 
> Yohann
> 
>  
> 
> From: devel [mailto:devel-boun...@open-mpi.org 
> ] On Behalf Of Paul Hargrove
> Sent: Wednesday, July 22, 2015 1:05 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] 1.10.0rc2
> 
>  
> 
> 1.10.0rc2 looks mostly good to me, but I still found some issues.
> 
>  
> 
>  
> 
> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi, icc, 
> clang, open64 and studio compilers.
> 
> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2).
> 
> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a callback 
> invoked in opal_progress().
> 
> Gdb did not give a function name for the  callback, but the PC looks valid.
> 
>  
> 
>  
> 
> 2) Of the several compilers I tried, only pgi-13.0 failed to compile mtl:ofi:
> 
>  
> 
> /bin/sh ../../../../libtool  --tag=CC   --mode=compile pgcc 
> -DHAVE_CONFIG_H -I. 
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi
>  -I../../../../opal/include -I../../../../orte/include 
> -I../../../../ompi/include -I../../../../oshmem/include 
> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen 
> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen  
> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include 
> -I/