Howard,

Not sure if the "--mca mtl_base_verbose 10" output is still needed, but
I've attached it in case it is.

-Paul

On Fri, Jul 24, 2015 at 7:26 AM, Howard Pritchard <hpprit...@gmail.com>
wrote:

> Paul
>
> Could you rerun with --mca mtl_base_verbose 10 added to cmd line and send
> output?
>
> Howard
>
> ----------
>
> sent from my smart phonr so no good type.
>
> Howard
> On Jul 23, 2015 6:06 PM, "Paul Hargrove" <phhargr...@lbl.gov> wrote:
>
>> Yohann,
>>
>> With PR409 as it stands right now (commit 6daef310) I see no change to
>> the behavior.
>> I still get a SEGV below opal_progress() unless I use either
>>    -mca mtl ^ofi
>> OR
>>    -mca pml cm
>>
>> A backtrace from gdb appears below.
>>
>> -Paul
>>
>> (gdb) where
>> #0  0x00007f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1
>> #1  0x00007f5bc7b5a119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
>> #2  0x00007f5bcc9b08f6 in __backtrace (array=<value optimized out>,
>> size=32)
>>     at ../sysdeps/ia64/backtrace.c:110
>> #3  0x00007f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880,
>>     prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2)
>>     at
>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47
>> #4  0x00007f5bcc3456a9 in show_stackframe (signo=11, info=0x7fff6181d770,
>> p=0x7fff6181d640)
>>     at
>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336
>> #5  <signal handler called>
>> #6  0x00007f5bc7717c58 in ?? ()
>> #7  0x00007f5bcc2f567a in opal_progress ()
>>     at
>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187
>> #8  0x00007f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78,
>> requested=0, provided=0x7fff6181dbf8)
>>     at
>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645
>> #9  0x00007f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c,
>> argv=0x7fff6181dc50) at pinit.c:84
>> #10 0x000000000040088e in main (argc=1, argv=0x7fff6181dd78) at
>> ring_c.c:19
>>
>> (gdb) up 6
>> #6  0x00007f5bc7717c58 in ?? ()
>> (gdb) disass
>> No function contains program counter for selected frame.
>>
>> On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann <
>> yohann.bure...@intel.com> wrote:
>>
>>>  Paul,
>>>
>>>
>>>
>>> While looking at the issue, we noticed that we were missing some code
>>> that deals with MTL priorities.
>>>
>>>
>>>
>>> PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is
>>> attempting to fix that.
>>>
>>>
>>>
>>> Hopefully, this will also fix the error you encountered.
>>>
>>>
>>>
>>> Thanks again,
>>>
>>> Yohann
>>>
>>>
>>>
>>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul
>>> Hargrove
>>> *Sent:* Wednesday, July 22, 2015 12:07 PM
>>>
>>> *To:* Open MPI Developers
>>> *Subject:* Re: [OMPI devel] 1.10.0rc2
>>>
>>>
>>>
>>> Yohann,
>>>
>>>
>>>
>>> Things run fine with those additional flags.
>>>
>>> In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV.
>>>
>>>
>>>
>>> -Paul
>>>
>>>
>>>
>>> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann <
>>> yohann.bure...@intel.com> wrote:
>>>
>>>  Hi Paul,
>>>
>>>
>>>
>>> Thank you for doing all this testing!
>>>
>>>
>>>
>>> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi or
>>> with how OMPI selects the components to use.
>>>
>>> Could you please run your test again with “--mca mtl ofi --mca
>>> mtl_ofi_provider sockets --mca pml cm”?
>>>
>>> The idea is that if it still fails, then we have a problem with either
>>> mtl:ofi or the OFI/sockets provider. If it works, then there is an issue
>>> with how OMPI selects what component to use.
>>>
>>>
>>>
>>> I just tried 1.10.0rc2 with the latest libfabric (master) and it seems
>>> to work fine.
>>>
>>>
>>>
>>> Yohann
>>>
>>>
>>>
>>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul
>>> Hargrove
>>> *Sent:* Wednesday, July 22, 2015 1:05 AM
>>> *To:* Open MPI Developers
>>> *Subject:* Re: [OMPI devel] 1.10.0rc2
>>>
>>>
>>>
>>> 1.10.0rc2 looks mostly good to me, but I still found some issues.
>>>
>>>
>>>
>>>
>>>
>>> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi,
>>> icc, clang, open64 and studio compilers.
>>>
>>> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2).
>>>
>>> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a
>>> callback invoked in opal_progress().
>>>
>>> Gdb did not give a function name for the  callback, but the PC looks
>>> valid.
>>>
>>>
>>>
>>>
>>>
>>> 2) Of the several compilers I tried, only pgi-13.0 failed to compile
>>> mtl:ofi:
>>>
>>>
>>>
>>>         /bin/sh ../../../../libtool  --tag=CC   --mode=compile pgcc
>>> -DHAVE_CONFIG_H -I.
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi
>>> -I../../../../opal/include -I../../../../orte/include
>>> -I../../../../ompi/include -I../../../../oshmem/include
>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen
>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen
>>>  -I/usr/common/ftg/libfabric/1.1.0rc2p1/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2
>>> -I../../../..
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include
>>>
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include
>>>  -g  -c -o mtl_ofi_component.lo
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c
>>>
>>> libtool: compile:  pgcc -DHAVE_CONFIG_H -I.
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi
>>> -I../../../../opal/include -I../../../../orte/include
>>> -I../../../../ompi/include -I../../../../oshmem/include
>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen
>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen
>>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2
>>> -I../../../..
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include
>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include
>>> -g -c
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c
>>>  -fpic -DPIC -o .libs/mtl_ofi_component.o
>>>
>>> PGC-S-0060-opal_convertor_clone is not a member of this struct or union
>>> (/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c:
>>> 51)
>>>
>>> pgcc-Fatal-/global/scratch2/sd/hargrove/pgi-13.10/linux86-64/13.10/bin/pgc
>>> TERMINATED by signal 11
>>>
>>>
>>>
>>> Since this ends with a SEGV in the compiler, I don't think this is an
>>> issue with the C code, just a plain compiler bug.
>>>
>>> At lease pgi-9.0-4 and pgi-10.9 compiled the code just fine.
>>>
>>>
>>>
>>>
>>>
>>> 3) As I noted in a separate email, there are some newly uncovered issues
>>> in the embedded hwloc w/ pgi and -m32.
>>>
>>> However, I had not tested such configurations previously, and all
>>> indications are that these issues have existed for a while.
>>>
>>> Brice is on vacation, so there will not be an official hwloc fix for
>>> this issue until next week at the earliest.
>>>
>>> [The upside is that I now have coverage for eight additional x86
>>> configurations (true x86 or x86-64 w/ -m32).]
>>>
>>>
>>>
>>>
>>>
>>> 4) I noticed a couple warnings somebody might want to investigate:
>>>
>>>
>>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:2323:59:
>>> warning: format specifies type 'int' but the argument has type 'struct
>>> ibv_qp *' [-Wformat]
>>>
>>>
>>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c",
>>> line 2471: warning: improper pointer/integer combination: arg #3
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Also worth noting:
>>>
>>>
>>>
>>> The ConnectX and ConnectIB XRC detection logic appears to be working as
>>> expected on multiple systems.
>>>
>>>
>>>
>>> I also have learned that pgi-9.0-4 is not a conforming C99 compiler when
>>> passed -m32, which is not Open MPI's fault.
>>>
>>>
>>>
>>>
>>>
>>> And as before...
>>>
>>> + I am currently without any SPARC platforms
>>>
>>> + Several qemu-emulated ARM and MIPS tests will complete by morning
>>> (though I have some ARM successes already)
>>>
>>>
>>>
>>>
>>>
>>> -Paul
>>>
>>>
>>>
>>> On Tue, Jul 21, 2015 at 12:29 PM, Ralph Castain <r...@open-mpi.org>
>>> wrote:
>>>
>>>  Hey folks
>>>
>>>
>>>
>>> 1.10.0rc2 is now out for review - excepting the library version numbers,
>>> this should be the final version. Please take a quick gander and let me
>>> know of any problems.
>>>
>>>
>>>
>>> http://www.open-mpi.org/software/ompi/v1.10/
>>>
>>>
>>>
>>> Ralph
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/07/17670.php
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>
>>> Computer Languages & Systems Software (CLaSS) Group
>>>
>>> Computer Science Department               Tel: +1-510-495-2352
>>>
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/07/17681.php
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>
>>> Computer Languages & Systems Software (CLaSS) Group
>>>
>>> Computer Science Department               Tel: +1-510-495-2352
>>>
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/07/17687.php
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/07/17688.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/07/17692.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
[pcp-f-5:16668] mca: base: components_register: registering mtl components
[pcp-f-5:16668] mca: base: components_register: found loaded component ofi
[pcp-f-5:16668] mca: base: components_register: component ofi register function 
successful
[pcp-f-5:16668] mca: base: components_open: opening mtl components
[pcp-f-5:16668] mca: base: components_open: found loaded component ofi
[pcp-f-5:16668] mca: base: components_open: component ofi open function 
successful
[pcp-f-5:16668] mca:base:select: Auto-selecting mtl components
[pcp-f-5:16668] mca:base:select:(  mtl) Querying component [ofi]
[pcp-f-5:16668] mca:base:select:(  mtl) Query of component [ofi] set priority 
to 10
[pcp-f-5:16668] mca:base:select:(  mtl) Selected component [ofi]
[pcp-f-5:16668] select: initializing mtl component ofi
[pcp-f-5:16669] mca: base: components_register: registering mtl components
[pcp-f-5:16669] mca: base: components_register: found loaded component ofi
[pcp-f-5:16669] mca: base: components_register: component ofi register function 
successful
[pcp-f-5:16669] mca: base: components_open: opening mtl components
[pcp-f-5:16669] mca: base: components_open: found loaded component ofi
[pcp-f-5:16669] mca: base: components_open: component ofi open function 
successful
[pcp-f-5:16669] mca:base:select: Auto-selecting mtl components
[pcp-f-5:16669] mca:base:select:(  mtl) Querying component [ofi]
[pcp-f-5:16669] mca:base:select:(  mtl) Query of component [ofi] set priority 
to 10
[pcp-f-5:16669] mca:base:select:(  mtl) Selected component [ofi]
[pcp-f-5:16669] select: initializing mtl component ofi
[pcp-f-5:16668] select: init returned success
[pcp-f-5:16668] select: component ofi selected
[pcp-f-5:16668] mca: base: close: component ofi closed
[pcp-f-5:16668] mca: base: close: unloading component ofi
[pcp-f-5:16668] *** Process received signal ***
[pcp-f-5:16668] Signal: Segmentation fault (11)
[pcp-f-5:16668] Signal code: Address not mapped (1)
[pcp-f-5:16668] Failing at address: 0x7fd3a7b06c58
[pcp-f-5:16669] select: init returned success
[pcp-f-5:16669] select: component ofi selected
[pcp-f-5:16669] mca: base: close: component ofi closed
[pcp-f-5:16669] mca: base: close: unloading component ofi
[pcp-f-5:16669] *** Process received signal ***
[pcp-f-5:16669] Signal: Segmentation fault (11)
[pcp-f-5:16669] Signal code: Address not mapped (1)
[pcp-f-5:16669] Failing at address: 0x7f99fc4c7c58
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 16669 on node pcp-f-5 exited on 
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Reply via email to