Hi Jeff, Nathan and I think this is generic to all the mtl's and masked by the stuff in the cm select method for upping the priority of the mtl. We'd see this behavior for all mtl's if this priority upping code wasn't there and we fell back to ob1.
Howard 2015-07-24 9:12 GMT-06:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > I think Ralph answered this question: if you register a progress function > but then get your component unloaded without un-registering the progress > function... kaboom. > > > > On Jul 24, 2015, at 10:37 AM, Howard Pritchard <hpprit...@gmail.com> > wrote: > > > > Jeff > > > > I was wrong about this. all the mtls except for portals4 register with > opal progress in their comp init. > > > > I dont see how this is a problem though as base select only invokes comp > init on the selected mtl. > > > > Howard > > > > ---------- > > > > sent from my smart phonr so no good type. > > > > Howard > > > > On Jul 24, 2015 8:19 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > wrote: > > Yohann -- > > > > Can you have a look? > > > > > > > On Jul 24, 2015, at 10:15 AM, Howard Pritchard <hpprit...@gmail.com> > wrote: > > > > > > looks like ofi mtl is being naughty. its tje onlx mtl which registers > with opal progress in component init method. > > > > > > ---------- > > > > > > sent from my smart phonr so no good type. > > > > > > Howard > > > > > > On Jul 23, 2015 7:03 PM, "Ralph Castain" <r...@open-mpi.org> wrote: > > > It looks like one of the MTL components is registering a progress call > with the opal_progress thread, and then unloading when de-selected. > Registering with opal_progress should only be done once the MTL has been > selected and will run > > > > > > > > >> On Jul 23, 2015, at 5:05 PM, Paul Hargrove <phhargr...@lbl.gov> > wrote: > > >> > > >> Yohann, > > >> > > >> With PR409 as it stands right now (commit 6daef310) I see no change > to the behavior. > > >> I still get a SEGV below opal_progress() unless I use either > > >> -mca mtl ^ofi > > >> OR > > >> -mca pml cm > > >> > > >> A backtrace from gdb appears below. > > >> > > >> -Paul > > >> > > >> (gdb) where > > >> #0 0x00007f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1 > > >> #1 0x00007f5bc7b5a119 in _Unwind_Backtrace () from > /lib64/libgcc_s.so.1 > > >> #2 0x00007f5bcc9b08f6 in __backtrace (array=<value optimized out>, > size=32) > > >> at ../sysdeps/ia64/backtrace.c:110 > > >> #3 0x00007f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880, > > >> prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2) > > >> at > /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47 > > >> #4 0x00007f5bcc3456a9 in show_stackframe (signo=11, > info=0x7fff6181d770, p=0x7fff6181d640) > > >> at > /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336 > > >> #5 <signal handler called> > > >> #6 0x00007f5bc7717c58 in ?? () > > >> #7 0x00007f5bcc2f567a in opal_progress () > > >> at > /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187 > > >> #8 0x00007f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78, > requested=0, provided=0x7fff6181dbf8) > > >> at > /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645 > > >> #9 0x00007f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c, > argv=0x7fff6181dc50) at pinit.c:84 > > >> #10 0x000000000040088e in main (argc=1, argv=0x7fff6181dd78) at > ring_c.c:19 > > >> > > >> (gdb) up 6 > > >> #6 0x00007f5bc7717c58 in ?? () > > >> (gdb) disass > > >> No function contains program counter for selected frame. > > >> > > >> On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann < > yohann.bure...@intel.com> wrote: > > >> Paul, > > >> > > >> > > >> > > >> While looking at the issue, we noticed that we were missing some code > that deals with MTL priorities. > > >> > > >> > > >> > > >> PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is > attempting to fix that. > > >> > > >> > > >> > > >> Hopefully, this will also fix the error you encountered. > > >> > > >> > > >> > > >> Thanks again, > > >> > > >> Yohann > > >> > > >> > > >> > > >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul > Hargrove > > >> Sent: Wednesday, July 22, 2015 12:07 PM > > >> > > >> > > >> To: Open MPI Developers > > >> Subject: Re: [OMPI devel] 1.10.0rc2 > > >> > > >> > > >> > > >> Yohann, > > >> > > >> > > >> > > >> Things run fine with those additional flags. > > >> > > >> In fact, adding just "--mca pml cm" is sufficient to eliminate the > SEGV. > > >> > > >> > > >> > > >> -Paul > > >> > > >> > > >> > > >> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann < > yohann.bure...@intel.com> wrote: > > >> > > >> Hi Paul, > > >> > > >> > > >> > > >> Thank you for doing all this testing! > > >> > > >> > > >> > > >> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi > or with how OMPI selects the components to use. > > >> > > >> Could you please run your test again with “--mca mtl ofi --mca > mtl_ofi_provider sockets --mca pml cm”? > > >> > > >> The idea is that if it still fails, then we have a problem with > either mtl:ofi or the OFI/sockets provider. If it works, then there is an > issue with how OMPI selects what component to use. > > >> > > >> > > >> > > >> I just tried 1.10.0rc2 with the latest libfabric (master) and it > seems to work fine. > > >> > > >> > > >> > > >> Yohann > > >> > > >> > > >> > > >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul > Hargrove > > >> Sent: Wednesday, July 22, 2015 1:05 AM > > >> To: Open MPI Developers > > >> Subject: Re: [OMPI devel] 1.10.0rc2 > > >> > > >> > > >> > > >> 1.10.0rc2 looks mostly good to me, but I still found some issues. > > >> > > >> > > >> > > >> > > >> > > >> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi, > icc, clang, open64 and studio compilers. > > >> > > >> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2). > > >> > > >> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a > callback invoked in opal_progress(). > > >> > > >> Gdb did not give a function name for the callback, but the PC looks > valid. > > >> > > >> > > >> > > >> > > >> > > >> 2) Of the several compilers I tried, only pgi-13.0 failed to compile > mtl:ofi: > > >> > > >> > > >> > > >> /bin/sh ../../../../libtool --tag=CC --mode=compile pgcc > -DHAVE_CONFIG_H -I. > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi > -I../../../../opal/include -I../../../../orte/include > -I../../../../ompi/include -I../../../../oshmem/include > -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen > -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen > -I/usr/common/ftg/libfabric/1.1.0rc2p1/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 > -I../../../.. > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include > > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include > -g -c -o mtl_ofi_component.lo > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c > > >> > > >> libtool: compile: pgcc -DHAVE_CONFIG_H -I. > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi > -I../../../../opal/include -I../../../../orte/include > -I../../../../ompi/include -I../../../../oshmem/include > -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen > -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen > -I/usr/common/ftg/libfabric/1.1.0rc2p1/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 > -I../../../.. > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include > -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include > -g -c > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c > -fpic -DPIC -o .libs/mtl_ofi_component.o > > >> > > >> PGC-S-0060-opal_convertor_clone is not a member of this struct or > union > (/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c: > 51) > > >> > > >> > pgcc-Fatal-/global/scratch2/sd/hargrove/pgi-13.10/linux86-64/13.10/bin/pgc > TERMINATED by signal 11 > > >> > > >> > > >> > > >> Since this ends with a SEGV in the compiler, I don't think this is an > issue with the C code, just a plain compiler bug. > > >> > > >> At lease pgi-9.0-4 and pgi-10.9 compiled the code just fine. > > >> > > >> > > >> > > >> > > >> > > >> 3) As I noted in a separate email, there are some newly uncovered > issues in the embedded hwloc w/ pgi and -m32. > > >> > > >> However, I had not tested such configurations previously, and all > indications are that these issues have existed for a while. > > >> > > >> Brice is on vacation, so there will not be an official hwloc fix for > this issue until next week at the earliest. > > >> > > >> [The upside is that I now have coverage for eight additional x86 > configurations (true x86 or x86-64 w/ -m32).] > > >> > > >> > > >> > > >> > > >> > > >> 4) I noticed a couple warnings somebody might want to investigate: > > >> > > >> > > openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:2323:59: > warning: format specifies type 'int' but the argument has type 'struct > ibv_qp *' [-Wformat] > > >> > > >> > openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c", > line 2471: warning: improper pointer/integer combination: arg #3 > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> Also worth noting: > > >> > > >> > > >> > > >> The ConnectX and ConnectIB XRC detection logic appears to be working > as expected on multiple systems. > > >> > > >> > > >> > > >> I also have learned that pgi-9.0-4 is not a conforming C99 compiler > when passed -m32, which is not Open MPI's fault. > > >> > > >> > > >> > > >> > > >> > > >> And as before... > > >> > > >> + I am currently without any SPARC platforms > > >> > > >> + Several qemu-emulated ARM and MIPS tests will complete by morning > (though I have some ARM successes already) > > >> > > >> > > >> > > >> > > >> > > >> -Paul > > >> > > >> > > >> > > >> On Tue, Jul 21, 2015 at 12:29 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > >> > > >> Hey folks > > >> > > >> > > >> > > >> 1.10.0rc2 is now out for review - excepting the library version > numbers, this should be the final version. Please take a quick gander and > let me know of any problems. > > >> > > >> > > >> > > >> http://www.open-mpi.org/software/ompi/v1.10/ > > >> > > >> > > >> > > >> Ralph > > >> > > >> > > >> > > >> > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17670.php > > >> > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> > > >> Paul H. Hargrove phhargr...@lbl.gov > > >> > > >> Computer Languages & Systems Software (CLaSS) Group > > >> > > >> Computer Science Department Tel: +1-510-495-2352 > > >> > > >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >> > > >> > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17681.php > > >> > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> > > >> Paul H. Hargrove phhargr...@lbl.gov > > >> > > >> Computer Languages & Systems Software (CLaSS) Group > > >> > > >> Computer Science Department Tel: +1-510-495-2352 > > >> > > >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >> > > >> > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17687.php > > >> > > >> > > >> > > >> -- > > >> Paul H. Hargrove phhargr...@lbl.gov > > >> Computer Languages & Systems Software (CLaSS) Group > > >> Computer Science Department Tel: +1-510-495-2352 > > >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17688.php > > > > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17689.php > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17690.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17691.php > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17694.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17695.php