Glancing at the code, I believe I see the problem. The OFI MTL component registers an opal progress function during init, but the CM PML is not the one ultimately selected. Thus, the CM PML has its finalize called and is unloaded.
During finalize, CM closes the MTL framework. This in turn calls component close on all the components, including OFI. However, the OFI component close function does *not* unregister the opal progress function. And so we segfault on first call to opal_progress So just add the unregister call to the OFI component close and you should be okay > On Jul 24, 2015, at 7:19 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > Yohann -- > > Can you have a look? > > >> On Jul 24, 2015, at 10:15 AM, Howard Pritchard <hpprit...@gmail.com> wrote: >> >> looks like ofi mtl is being naughty. its tje onlx mtl which registers with >> opal progress in component init method. >> >> ---------- >> >> sent from my smart phonr so no good type. >> >> Howard >> >> On Jul 23, 2015 7:03 PM, "Ralph Castain" <r...@open-mpi.org> wrote: >> It looks like one of the MTL components is registering a progress call with >> the opal_progress thread, and then unloading when de-selected. Registering >> with opal_progress should only be done once the MTL has been selected and >> will run >> >> >>> On Jul 23, 2015, at 5:05 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >>> >>> Yohann, >>> >>> With PR409 as it stands right now (commit 6daef310) I see no change to the >>> behavior. >>> I still get a SEGV below opal_progress() unless I use either >>> -mca mtl ^ofi >>> OR >>> -mca pml cm >>> >>> A backtrace from gdb appears below. >>> >>> -Paul >>> >>> (gdb) where >>> #0 0x00007f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1 >>> #1 0x00007f5bc7b5a119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 >>> #2 0x00007f5bcc9b08f6 in __backtrace (array=<value optimized out>, size=32) >>> at ../sysdeps/ia64/backtrace.c:110 >>> #3 0x00007f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880, >>> prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2) >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47 >>> #4 0x00007f5bcc3456a9 in show_stackframe (signo=11, info=0x7fff6181d770, >>> p=0x7fff6181d640) >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336 >>> #5 <signal handler called> >>> #6 0x00007f5bc7717c58 in ?? () >>> #7 0x00007f5bcc2f567a in opal_progress () >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187 >>> #8 0x00007f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78, >>> requested=0, provided=0x7fff6181dbf8) >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645 >>> #9 0x00007f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c, >>> argv=0x7fff6181dc50) at pinit.c:84 >>> #10 0x000000000040088e in main (argc=1, argv=0x7fff6181dd78) at ring_c.c:19 >>> >>> (gdb) up 6 >>> #6 0x00007f5bc7717c58 in ?? () >>> (gdb) disass >>> No function contains program counter for selected frame. >>> >>> On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann <yohann.bure...@intel.com> >>> wrote: >>> Paul, >>> >>> >>> >>> While looking at the issue, we noticed that we were missing some code that >>> deals with MTL priorities. >>> >>> >>> >>> PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is attempting to >>> fix that. >>> >>> >>> >>> Hopefully, this will also fix the error you encountered. >>> >>> >>> >>> Thanks again, >>> >>> Yohann >>> >>> >>> >>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Hargrove >>> Sent: Wednesday, July 22, 2015 12:07 PM >>> >>> >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] 1.10.0rc2 >>> >>> >>> >>> Yohann, >>> >>> >>> >>> Things run fine with those additional flags. >>> >>> In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV. >>> >>> >>> >>> -Paul >>> >>> >>> >>> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann <yohann.bure...@intel.com> >>> wrote: >>> >>> Hi Paul, >>> >>> >>> >>> Thank you for doing all this testing! >>> >>> >>> >>> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi or >>> with how OMPI selects the components to use. >>> >>> Could you please run your test again with “--mca mtl ofi --mca >>> mtl_ofi_provider sockets --mca pml cm”? >>> >>> The idea is that if it still fails, then we have a problem with either >>> mtl:ofi or the OFI/sockets provider. If it works, then there is an issue >>> with how OMPI selects what component to use. >>> >>> >>> >>> I just tried 1.10.0rc2 with the latest libfabric (master) and it seems to >>> work fine. >>> >>> >>> >>> Yohann >>> >>> >>> >>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Hargrove >>> Sent: Wednesday, July 22, 2015 1:05 AM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] 1.10.0rc2 >>> >>> >>> >>> 1.10.0rc2 looks mostly good to me, but I still found some issues. >>> >>> >>> >>> >>> >>> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi, icc, >>> clang, open64 and studio compilers. >>> >>> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2). >>> >>> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a >>> callback invoked in opal_progress(). >>> >>> Gdb did not give a function name for the callback, but the PC looks valid. >>> >>> >>> >>> >>> >>> 2) Of the several compilers I tried, only pgi-13.0 failed to compile >>> mtl:ofi: >>> >>> >>> >>> /bin/sh ../../../../libtool --tag=CC --mode=compile pgcc >>> -DHAVE_CONFIG_H -I. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi >>> -I../../../../opal/include -I../../../../orte/include >>> -I../../../../ompi/include -I../../../../oshmem/include >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen >>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 >>> -I../../../.. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include >>> -g -c -o mtl_ofi_component.lo >>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c >>> >>> libtool: compile: pgcc -DHAVE_CONFIG_H -I. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi >>> -I../../../../opal/include -I../../../../orte/include >>> -I../../../../ompi/include -I../../../../oshmem/include >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen >>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen >>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 >>> -I../../../.. >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include >>> >>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include >>> -g -c >>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c >>> -fpic -DPIC -o .libs/mtl_ofi_component.o >>> >>> PGC-S-0060-opal_convertor_clone is not a member of this struct or union >>> (/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c: >>> 51) >>> >>> pgcc-Fatal-/global/scratch2/sd/hargrove/pgi-13.10/linux86-64/13.10/bin/pgc >>> TERMINATED by signal 11 >>> >>> >>> >>> Since this ends with a SEGV in the compiler, I don't think this is an issue >>> with the C code, just a plain compiler bug. >>> >>> At lease pgi-9.0-4 and pgi-10.9 compiled the code just fine. >>> >>> >>> >>> >>> >>> 3) As I noted in a separate email, there are some newly uncovered issues in >>> the embedded hwloc w/ pgi and -m32. >>> >>> However, I had not tested such configurations previously, and all >>> indications are that these issues have existed for a while. >>> >>> Brice is on vacation, so there will not be an official hwloc fix for this >>> issue until next week at the earliest. >>> >>> [The upside is that I now have coverage for eight additional x86 >>> configurations (true x86 or x86-64 w/ -m32).] >>> >>> >>> >>> >>> >>> 4) I noticed a couple warnings somebody might want to investigate: >>> >>> >>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:2323:59: >>> warning: format specifies type 'int' but the argument has type 'struct >>> ibv_qp *' [-Wformat] >>> >>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c", >>> line 2471: warning: improper pointer/integer combination: arg #3 >>> >>> >>> >>> >>> >>> >>> >>> Also worth noting: >>> >>> >>> >>> The ConnectX and ConnectIB XRC detection logic appears to be working as >>> expected on multiple systems. >>> >>> >>> >>> I also have learned that pgi-9.0-4 is not a conforming C99 compiler when >>> passed -m32, which is not Open MPI's fault. >>> >>> >>> >>> >>> >>> And as before... >>> >>> + I am currently without any SPARC platforms >>> >>> + Several qemu-emulated ARM and MIPS tests will complete by morning (though >>> I have some ARM successes already) >>> >>> >>> >>> >>> >>> -Paul >>> >>> >>> >>> On Tue, Jul 21, 2015 at 12:29 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> Hey folks >>> >>> >>> >>> 1.10.0rc2 is now out for review - excepting the library version numbers, >>> this should be the final version. Please take a quick gander and let me >>> know of any problems. >>> >>> >>> >>> http://www.open-mpi.org/software/ompi/v1.10/ >>> >>> >>> >>> Ralph >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17670.php >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Paul H. Hargrove phhargr...@lbl.gov >>> >>> Computer Languages & Systems Software (CLaSS) Group >>> >>> Computer Science Department Tel: +1-510-495-2352 >>> >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17681.php >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Paul H. Hargrove phhargr...@lbl.gov >>> >>> Computer Languages & Systems Software (CLaSS) Group >>> >>> Computer Science Department Tel: +1-510-495-2352 >>> >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17687.php >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17688.php >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/07/17689.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/07/17690.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17691.php