Howard, Not sure if the "--mca mtl_base_verbose 10" output is still needed, but I've attached it in case it is.
-Paul On Fri, Jul 24, 2015 at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> wrote: > Paul > > Could you rerun with --mca mtl_base_verbose 10 added to cmd line and send > output? > > Howard > > ---------- > > sent from my smart phonr so no good type. > > Howard > On Jul 23, 2015 6:06 PM, "Paul Hargrove" <phhargr...@lbl.gov> wrote: > >> Yohann, >> >> With PR409 as it stands right now (commit 6daef310) I see no change to >> the behavior. >> I still get a SEGV below opal_progress() unless I use either >> -mca mtl ^ofi >> OR >> -mca pml cm >> >> A backtrace from gdb appears below. >> >> -Paul >> >> (gdb) where >> #0 0x00007f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1 >> #1 0x00007f5bc7b5a119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 >> #2 0x00007f5bcc9b08f6 in __backtrace (array=<value optimized out>, >> size=32) >> at ../sysdeps/ia64/backtrace.c:110 >> #3 0x00007f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880, >> prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2) >> at >> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47 >> #4 0x00007f5bcc3456a9 in show_stackframe (signo=11, info=0x7fff6181d770, >> p=0x7fff6181d640) >> at >> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336 >> #5 <signal handler called> >> #6 0x00007f5bc7717c58 in ?? () >> #7 0x00007f5bcc2f567a in opal_progress () >> at >> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187 >> #8 0x00007f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78, >> requested=0, provided=0x7fff6181dbf8) >> at >> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645 >> #9 0x00007f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c, >> argv=0x7fff6181dc50) at pinit.c:84 >> #10 0x000000000040088e in main (argc=1, argv=0x7fff6181dd78) at >> ring_c.c:19 >> >> (gdb) up 6 >> #6 0x00007f5bc7717c58 in ?? () >> (gdb) disass >> No function contains program counter for selected frame. >> >> On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann < >> yohann.bure...@intel.com> wrote: >> >>> Paul, >>> >>> >>> >>> While looking at the issue, we noticed that we were missing some code >>> that deals with MTL priorities. >>> >>> >>> >>> PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is >>> attempting to fix that. >>> >>> >>> >>> Hopefully, this will also fix the error you encountered. >>> >>> >>> >>> Thanks again, >>> >>> Yohann >>> >>> >>> >>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul >>> Hargrove >>> *Sent:* Wednesday, July 22, 2015 12:07 PM >>> >>> *To:* Open MPI Developers >>> *Subject:* Re: [OMPI devel] 1.10.0rc2 >>> >>> >>> >>> Yohann, >>> >>> >>> >>> Things run fine with those additional flags. >>> >>> In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV. >>> >>> >>> >>> -Paul >>> >>> >>> >>> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann < >>> yohann.bure...@intel.com> wrote: >>> >>> Hi Paul, >>> >>> >>> >>> Thank you for doing all this testing! >>> >>> >>> >>> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi or >>> with how OMPI selects the components to use. >>> >>> Could you please run your test again with “--mca mtl ofi --mca >>> mtl_ofi_provider sockets --mca pml cm”? >>> >>> The idea is that if it still fails, then we have a problem with either >>> mtl:ofi or the OFI/sockets provider. If it works, then there is an issue >>> with how OMPI selects what component to use. >>> >>> >>> >>> I just tried 1.10.0rc2 with the latest libfabric (master) and it seems >>> to work fine. >>> >>> >>> >>> Yohann >>> >>> >>> >>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul >>> Hargrove >>> *Sent:* Wednesday, July 22, 2015 1:05 AM >>> *To:* Open MPI Developers >>> *Subject:* Re: [OMPI devel] 1.10.0rc2 >>> >>> >>> >>> 1.10.0rc2 looks mostly good to me, but I still found some issues. >>> >>> >>> >>> >>> >>> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi, >>> icc, clang, open64 and studio compilers. >>> >>> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2). >>> >>> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a >>> callback invoked in opal_progress(). >>> >>> Gdb did not give a function name for the callback, but the PC looks >>> valid. >>> >>> >>> >>> >>> >>> 2) Of the several compilers I tried, only pgi-13.0 failed to compile >>> mtl:ofi: >>> >>> >>> >>> /bin/sh ../../../../libtool --tag=CC --mode=compile pgcc >>> -DHAVE_CONFIG_H -I. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi >>> -I../../../../opal/include -I../../../../orte/include >>> -I../../../../ompi/include -I../../../../oshmem/include >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen >>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 >>> -I../../../.. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include >>> -g -c -o mtl_ofi_component.lo >>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c >>> >>> libtool: compile: pgcc -DHAVE_CONFIG_H -I. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi >>> -I../../../../opal/include -I../../../../orte/include >>> -I../../../../ompi/include -I../../../../oshmem/include >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen >>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 >>> -I../../../.. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include >>> -g -c >>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c >>> -fpic -DPIC -o .libs/mtl_ofi_component.o >>> >>> PGC-S-0060-opal_convertor_clone is not a member of this struct or union >>> (/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c: >>> 51) >>> >>> pgcc-Fatal-/global/scratch2/sd/hargrove/pgi-13.10/linux86-64/13.10/bin/pgc >>> TERMINATED by signal 11 >>> >>> >>> >>> Since this ends with a SEGV in the compiler, I don't think this is an >>> issue with the C code, just a plain compiler bug. >>> >>> At lease pgi-9.0-4 and pgi-10.9 compiled the code just fine. >>> >>> >>> >>> >>> >>> 3) As I noted in a separate email, there are some newly uncovered issues >>> in the embedded hwloc w/ pgi and -m32. >>> >>> However, I had not tested such configurations previously, and all >>> indications are that these issues have existed for a while. >>> >>> Brice is on vacation, so there will not be an official hwloc fix for >>> this issue until next week at the earliest. >>> >>> [The upside is that I now have coverage for eight additional x86 >>> configurations (true x86 or x86-64 w/ -m32).] >>> >>> >>> >>> >>> >>> 4) I noticed a couple warnings somebody might want to investigate: >>> >>> >>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:2323:59: >>> warning: format specifies type 'int' but the argument has type 'struct >>> ibv_qp *' [-Wformat] >>> >>> >>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c", >>> line 2471: warning: improper pointer/integer combination: arg #3 >>> >>> >>> >>> >>> >>> >>> >>> Also worth noting: >>> >>> >>> >>> The ConnectX and ConnectIB XRC detection logic appears to be working as >>> expected on multiple systems. >>> >>> >>> >>> I also have learned that pgi-9.0-4 is not a conforming C99 compiler when >>> passed -m32, which is not Open MPI's fault. >>> >>> >>> >>> >>> >>> And as before... >>> >>> + I am currently without any SPARC platforms >>> >>> + Several qemu-emulated ARM and MIPS tests will complete by morning >>> (though I have some ARM successes already) >>> >>> >>> >>> >>> >>> -Paul >>> >>> >>> >>> On Tue, Jul 21, 2015 at 12:29 PM, Ralph Castain <r...@open-mpi.org> >>> wrote: >>> >>> Hey folks >>> >>> >>> >>> 1.10.0rc2 is now out for review - excepting the library version numbers, >>> this should be the final version. Please take a quick gander and let me >>> know of any problems. >>> >>> >>> >>> http://www.open-mpi.org/software/ompi/v1.10/ >>> >>> >>> >>> Ralph >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17670.php >>> >>> >>> >>> >>> >>> -- >>> >>> Paul H. Hargrove phhargr...@lbl.gov >>> >>> Computer Languages & Systems Software (CLaSS) Group >>> >>> Computer Science Department Tel: +1-510-495-2352 >>> >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17681.php >>> >>> >>> >>> >>> >>> -- >>> >>> Paul H. Hargrove phhargr...@lbl.gov >>> >>> Computer Languages & Systems Software (CLaSS) Group >>> >>> Computer Science Department Tel: +1-510-495-2352 >>> >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17687.php >>> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/07/17688.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17692.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[pcp-f-5:16668] mca: base: components_register: registering mtl components [pcp-f-5:16668] mca: base: components_register: found loaded component ofi [pcp-f-5:16668] mca: base: components_register: component ofi register function successful [pcp-f-5:16668] mca: base: components_open: opening mtl components [pcp-f-5:16668] mca: base: components_open: found loaded component ofi [pcp-f-5:16668] mca: base: components_open: component ofi open function successful [pcp-f-5:16668] mca:base:select: Auto-selecting mtl components [pcp-f-5:16668] mca:base:select:( mtl) Querying component [ofi] [pcp-f-5:16668] mca:base:select:( mtl) Query of component [ofi] set priority to 10 [pcp-f-5:16668] mca:base:select:( mtl) Selected component [ofi] [pcp-f-5:16668] select: initializing mtl component ofi [pcp-f-5:16669] mca: base: components_register: registering mtl components [pcp-f-5:16669] mca: base: components_register: found loaded component ofi [pcp-f-5:16669] mca: base: components_register: component ofi register function successful [pcp-f-5:16669] mca: base: components_open: opening mtl components [pcp-f-5:16669] mca: base: components_open: found loaded component ofi [pcp-f-5:16669] mca: base: components_open: component ofi open function successful [pcp-f-5:16669] mca:base:select: Auto-selecting mtl components [pcp-f-5:16669] mca:base:select:( mtl) Querying component [ofi] [pcp-f-5:16669] mca:base:select:( mtl) Query of component [ofi] set priority to 10 [pcp-f-5:16669] mca:base:select:( mtl) Selected component [ofi] [pcp-f-5:16669] select: initializing mtl component ofi [pcp-f-5:16668] select: init returned success [pcp-f-5:16668] select: component ofi selected [pcp-f-5:16668] mca: base: close: component ofi closed [pcp-f-5:16668] mca: base: close: unloading component ofi [pcp-f-5:16668] *** Process received signal *** [pcp-f-5:16668] Signal: Segmentation fault (11) [pcp-f-5:16668] Signal code: Address not mapped (1) [pcp-f-5:16668] Failing at address: 0x7fd3a7b06c58 [pcp-f-5:16669] select: init returned success [pcp-f-5:16669] select: component ofi selected [pcp-f-5:16669] mca: base: close: component ofi closed [pcp-f-5:16669] mca: base: close: unloading component ofi [pcp-f-5:16669] *** Process received signal *** [pcp-f-5:16669] Signal: Segmentation fault (11) [pcp-f-5:16669] Signal code: Address not mapped (1) [pcp-f-5:16669] Failing at address: 0x7f99fc4c7c58 -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 16669 on node pcp-f-5 exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------