I admit to having lost track of the discussion split among the various PRs and this email thread.
I have the following three system to test on: #1) ofi is the only mtl component which can build. #2) Both the ofi and portals4 mtl conponents build #3) Both the psm and mxm mtl components build I have applied *both* https://github.com/hppritcha/ompi-release/commit/6daef310.patch (ompi-release PR409) https://github.com/hppritcha/ompi/commit/bd78ba0c.patch (ompi PR747) to the 1.10.0rc2 tarball. I can report that on system #1 which previously would SEGV, I can now run w/o any extra args to mpirun (just "mpirun -np 2 examples/ring_c"). This is on a single workstation with no network or batch system. On system #2, I am OK with no args, things ran fine (and would SEGV before). However "--mca btl sm,self" (on a single host, obviously) still results in a SEGV unless I also add "--mca mtl ^ofi" There is no backtrace printed at runtime, and the core appears useless: Core was generated by `examples/ring_c'. Program terminated with signal 11, Segmentation fault. #0 0x00002b82db0ca638 in ?? () from /lib64/libgcc_s.so.1 (gdb) where #0 0x00002b82db0ca638 in ?? () from /lib64/libgcc_s.so.1 #1 0x00002b82db0cb8bb in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #2 0x00002b82db5d1fa8 in backtrace () from /lib64/libc.so.6 #3 0x00002b82dbd90b22 in opal_backtrace_print (file=0x2b82daae0480, prefix=0x0, strip=-1388398139) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-icc-14/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47 #4 0x00002b82dbd8d484 in show_stackframe (signo=-626129792, info=0x0, p=0x2aaaad3eb9c5) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-icc-14/openmpi-1.10.0rc2/opal/util/stacktrace.c:336 #5 <signal handler called> #6 0x00002aaaad3eb9c5 in ?? () #7 0x404ddf1a9fbe76c9 in ?? () #8 0x00002aaaad3ec56d in ?? () #9 0x0000001300000001 in ?? () #10 0x0000000000000000 in ?? () If I try "-np 1" then I see the following (with mtl_base_verbose=1): $ mpirun --mca btl sm,self -mca mtl_base_verbose 10 -mca mtl ofi -np 1 examples/ring_c [c1480:31521] mca: base: components_register: registering mtl components [c1480:31521] mca: base: components_register: found loaded component ofi [c1480:31521] mca: base: components_register: component ofi register function successful [c1480:31521] mca: base: components_open: opening mtl components [c1480:31521] mca: base: components_open: found loaded component ofi [c1480:31521] mca: base: components_open: component ofi open function successful [c1480:31521] mca:base:select: Auto-selecting mtl components [c1480:31521] mca:base:select:( mtl) Querying component [ofi] [c1480:31521] mca:base:select:( mtl) Query of component [ofi] set priority to 10 [c1480:31521] mca:base:select:( mtl) Selected component [ofi] [c1480:31521] select: initializing mtl component ofi [c1480:31521] select: init returned success [c1480:31521] select: component ofi selected [c1480:31521] mca: base: close: component ofi closed [c1480:31521] mca: base: close: unloading component ofi [c1480:31521] *** Process received signal *** [c1480:31521] Signal: Segmentation fault (11) [c1480:31521] Signal code: Address not mapped (1) [c1480:31521] Failing at address: 0x2aaaae4c19c5 Process 0 sending 10 to 0, tag 201 (1 processes in ring) Process 0 sent to 0 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 31521 on node c1480 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- The output above, may or may not mean something more to one of you, but to me it looks *consistent* with the possibility of a callback (such as progress function) running in mtl:ofi after it has been unloaded. When I replace "-mca mtl ofi" with "-mca mtl ^ofi" or "-mca mtl portals4", the SEGV goes away. I was also surprised to see, on system #2 which has InfiniBand: [c1479][[2546,1],1][/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-icc-14/openmpi-1.10.0rc2/ompi/mca/btl/openib/btl_openib_xrc.c:57:mca_btl_openib_xrc_check_api] XRC error: bad XRC API (require XRC from OFED pre 3.12). However, I'll start a separate thread for that issue AFTER I make certain that the (M)OFED library versions on the frontend and compute nodes match. System #3 had no problems before and still has none now (and is "in the mix" just for coverage of PR409). -Paul On Fri, Jul 24, 2015 at 3:28 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Howard, > > Not sure if the "--mca mtl_base_verbose 10" output is still needed, but > I've attached it in case it is. > > -Paul > > On Fri, Jul 24, 2015 at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> > wrote: > >> Paul >> >> Could you rerun with --mca mtl_base_verbose 10 added to cmd line and send >> output? >> >> Howard >> >> ---------- >> >> sent from my smart phonr so no good type. >> >> Howard >> On Jul 23, 2015 6:06 PM, "Paul Hargrove" <phhargr...@lbl.gov> wrote: >> >>> Yohann, >>> >>> With PR409 as it stands right now (commit 6daef310) I see no change to >>> the behavior. >>> I still get a SEGV below opal_progress() unless I use either >>> -mca mtl ^ofi >>> OR >>> -mca pml cm >>> >>> A backtrace from gdb appears below. >>> >>> -Paul >>> >>> (gdb) where >>> #0 0x00007f5bc7b59867 in ?? () from /lib64/libgcc_s.so.1 >>> #1 0x00007f5bc7b5a119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 >>> #2 0x00007f5bcc9b08f6 in __backtrace (array=<value optimized out>, >>> size=32) >>> at ../sysdeps/ia64/backtrace.c:110 >>> #3 0x00007f5bcc3483e1 in opal_backtrace_print (file=0x7f5bccc40880, >>> prefix=0x7fff6181d1f0 "[pcp-f-5:05049] ", strip=2) >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/mca/backtrace/execinfo/backtrace_execinfo.c:47 >>> #4 0x00007f5bcc3456a9 in show_stackframe (signo=11, >>> info=0x7fff6181d770, p=0x7fff6181d640) >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/util/stacktrace.c:336 >>> #5 <signal handler called> >>> #6 0x00007f5bc7717c58 in ?? () >>> #7 0x00007f5bcc2f567a in opal_progress () >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/opal/runtime/opal_progress.c:187 >>> #8 0x00007f5bccebbcb9 in ompi_mpi_init (argc=1, argv=0x7fff6181dd78, >>> requested=0, provided=0x7fff6181dbf8) >>> at >>> /scratch/phargrov/OMPI/openmpi-1.10.0rc2-linux-x86_64-sl6x/openmpi-1.10.0rc2/ompi/runtime/ompi_mpi_init.c:645 >>> #9 0x00007f5bccefbe77 in PMPI_Init (argc=0x7fff6181dc5c, >>> argv=0x7fff6181dc50) at pinit.c:84 >>> #10 0x000000000040088e in main (argc=1, argv=0x7fff6181dd78) at >>> ring_c.c:19 >>> >>> (gdb) up 6 >>> #6 0x00007f5bc7717c58 in ?? () >>> (gdb) disass >>> No function contains program counter for selected frame. >>> >>> On Thu, Jul 23, 2015 at 8:13 AM, Burette, Yohann < >>> yohann.bure...@intel.com> wrote: >>> >>>> Paul, >>>> >>>> >>>> >>>> While looking at the issue, we noticed that we were missing some code >>>> that deals with MTL priorities. >>>> >>>> >>>> >>>> PR 409 (https://github.com/open-mpi/ompi-release/pull/409) is >>>> attempting to fix that. >>>> >>>> >>>> >>>> Hopefully, this will also fix the error you encountered. >>>> >>>> >>>> >>>> Thanks again, >>>> >>>> Yohann >>>> >>>> >>>> >>>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul >>>> Hargrove >>>> *Sent:* Wednesday, July 22, 2015 12:07 PM >>>> >>>> *To:* Open MPI Developers >>>> *Subject:* Re: [OMPI devel] 1.10.0rc2 >>>> >>>> >>>> >>>> Yohann, >>>> >>>> >>>> >>>> Things run fine with those additional flags. >>>> >>>> In fact, adding just "--mca pml cm" is sufficient to eliminate the SEGV. >>>> >>>> >>>> >>>> -Paul >>>> >>>> >>>> >>>> On Wed, Jul 22, 2015 at 8:49 AM, Burette, Yohann < >>>> yohann.bure...@intel.com> wrote: >>>> >>>> Hi Paul, >>>> >>>> >>>> >>>> Thank you for doing all this testing! >>>> >>>> >>>> >>>> About 1), it’s hard for me to see whether it’s a problem with mtl:ofi >>>> or with how OMPI selects the components to use. >>>> >>>> Could you please run your test again with “--mca mtl ofi --mca >>>> mtl_ofi_provider sockets --mca pml cm”? >>>> >>>> The idea is that if it still fails, then we have a problem with either >>>> mtl:ofi or the OFI/sockets provider. If it works, then there is an issue >>>> with how OMPI selects what component to use. >>>> >>>> >>>> >>>> I just tried 1.10.0rc2 with the latest libfabric (master) and it seems >>>> to work fine. >>>> >>>> >>>> >>>> Yohann >>>> >>>> >>>> >>>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Paul >>>> Hargrove >>>> *Sent:* Wednesday, July 22, 2015 1:05 AM >>>> *To:* Open MPI Developers >>>> *Subject:* Re: [OMPI devel] 1.10.0rc2 >>>> >>>> >>>> >>>> 1.10.0rc2 looks mostly good to me, but I still found some issues. >>>> >>>> >>>> >>>> >>>> >>>> 1) New to this round of testing, I have built mtl:ofi with gcc, pgi, >>>> icc, clang, open64 and studio compilers. >>>> >>>> I have only the sockets provider in libfaric (v1.0.0 and 1.1.0rc2). >>>> >>>> However, unless I pass "-mca mtl ^ofi" to mpirun I get a SEGV from a >>>> callback invoked in opal_progress(). >>>> >>>> Gdb did not give a function name for the callback, but the PC looks >>>> valid. >>>> >>>> >>>> >>>> >>>> >>>> 2) Of the several compilers I tried, only pgi-13.0 failed to compile >>>> mtl:ofi: >>>> >>>> >>>> >>>> /bin/sh ../../../../libtool --tag=CC --mode=compile pgcc >>>> -DHAVE_CONFIG_H -I. >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi >>>> -I../../../../opal/include -I../../../../orte/include >>>> -I../../../../ompi/include -I../../../../oshmem/include >>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen >>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen >>>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 >>>> -I../../../.. >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include >>>> >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include >>>> -g -c -o mtl_ofi_component.lo >>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c >>>> >>>> libtool: compile: pgcc -DHAVE_CONFIG_H -I. >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi >>>> -I../../../../opal/include -I../../../../orte/include >>>> -I../../../../ompi/include -I../../../../oshmem/include >>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen >>>> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen >>>> -I/usr/common/ftg/libfabric/1.1.0rc2p1/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2 >>>> -I../../../.. >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/orte/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/oshmem/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/hwloc/hwloc191/hwloc/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/hwloc/hwloc191/hwloc/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/opal/mca/event/libevent2021/libevent/include >>>> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/BLD/opal/mca/event/libevent2021/libevent/include >>>> -g -c >>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c >>>> -fpic -DPIC -o .libs/mtl_ofi_component.o >>>> >>>> PGC-S-0060-opal_convertor_clone is not a member of this struct or union >>>> (/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-pgi-13.10/openmpi-1.10.0rc2/ompi/mca/mtl/ofi/mtl_ofi_component.c: >>>> 51) >>>> >>>> pgcc-Fatal-/global/scratch2/sd/hargrove/pgi-13.10/linux86-64/13.10/bin/pgc >>>> TERMINATED by signal 11 >>>> >>>> >>>> >>>> Since this ends with a SEGV in the compiler, I don't think this is an >>>> issue with the C code, just a plain compiler bug. >>>> >>>> At lease pgi-9.0-4 and pgi-10.9 compiled the code just fine. >>>> >>>> >>>> >>>> >>>> >>>> 3) As I noted in a separate email, there are some newly uncovered >>>> issues in the embedded hwloc w/ pgi and -m32. >>>> >>>> However, I had not tested such configurations previously, and all >>>> indications are that these issues have existed for a while. >>>> >>>> Brice is on vacation, so there will not be an official hwloc fix for >>>> this issue until next week at the earliest. >>>> >>>> [The upside is that I now have coverage for eight additional x86 >>>> configurations (true x86 or x86-64 w/ -m32).] >>>> >>>> >>>> >>>> >>>> >>>> 4) I noticed a couple warnings somebody might want to investigate: >>>> >>>> >>>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:2323:59: >>>> warning: format specifies type 'int' but the argument has type 'struct >>>> ibv_qp *' [-Wformat] >>>> >>>> >>>> openmpi-1.10.0rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c", >>>> line 2471: warning: improper pointer/integer combination: arg #3 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Also worth noting: >>>> >>>> >>>> >>>> The ConnectX and ConnectIB XRC detection logic appears to be working as >>>> expected on multiple systems. >>>> >>>> >>>> >>>> I also have learned that pgi-9.0-4 is not a conforming C99 compiler >>>> when passed -m32, which is not Open MPI's fault. >>>> >>>> >>>> >>>> >>>> >>>> And as before... >>>> >>>> + I am currently without any SPARC platforms >>>> >>>> + Several qemu-emulated ARM and MIPS tests will complete by morning >>>> (though I have some ARM successes already) >>>> >>>> >>>> >>>> >>>> >>>> -Paul >>>> >>>> >>>> >>>> On Tue, Jul 21, 2015 at 12:29 PM, Ralph Castain <r...@open-mpi.org> >>>> wrote: >>>> >>>> Hey folks >>>> >>>> >>>> >>>> 1.10.0rc2 is now out for review - excepting the library version >>>> numbers, this should be the final version. Please take a quick gander and >>>> let me know of any problems. >>>> >>>> >>>> >>>> http://www.open-mpi.org/software/ompi/v1.10/ >>>> >>>> >>>> >>>> Ralph >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/07/17670.php >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> >>>> Computer Languages & Systems Software (CLaSS) Group >>>> >>>> Computer Science Department Tel: +1-510-495-2352 >>>> >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/07/17681.php >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> >>>> Computer Languages & Systems Software (CLaSS) Group >>>> >>>> Computer Science Department Tel: +1-510-495-2352 >>>> >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/07/17687.php >>>> >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17688.php >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/07/17692.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900