do you have "-disable-dlopen" in your configure option? This might force coll_ml to be loaded first even with -mca coll ^ml.
next HPCX is expected to release by end of Aug. -Devendar On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <dshra...@lanl.gov> wrote: > I remember seeing those, but forgot about them. I am curious, though, why > using '-mca coll ^ml' wouldn't work for me. > > We'll watch for the next HPCX release. Is there an ETA on when that > release may happen? Thank you for the help! > David > > > On 08/12/2015 04:04 PM, Deva wrote: > > David, > > This is because of hcoll symbols conflict with ml coll module inside OMPI. > HCOLL is derived from ml module. This issue is fixed in hcoll library and > will be available in next HPCX release. > > Some earlier discussion on this issue: > http://www.open-mpi.org/community/lists/users/2015/06/27154.php > http://www.open-mpi.org/community/lists/devel/2015/06/17562.php > > -Devendar > > On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov> wrote: > >> Interesting... the seg faults went away: >> >> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so >> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out >> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs >> [1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM WARN Could >> not open the KNEM device file at /dev/knem : No such file or direc >> tory. Won't use knem. >> [1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM WARN Could >> not open the KNEM device file at /dev/knem : No such file or direc >> tory. Won't use knem. >> 0: Running on host zo-fe1.lanl.gov >> 0: We have 2 processors >> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty >> >> This implies to me that some other library is being used instead of >> /usr/lib64/libhcoll.so, but I am not sure how that could be... >> >> Thanks, >> David >> >> On 08/12/2015 03:30 PM, Deva wrote: >> >> Hi David, >> >> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the >> issue. Can you do one more quick test with seeing LD_PRELOAD to hcoll lib? >> >> $LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so> mpirun -n 2 -mca coll ^ml >> ./a.out >> >> -Devendar >> >> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader < <dshra...@lanl.gov> >> dshra...@lanl.gov> wrote: >> >>> The admin that rolled the hcoll rpm that we're using (and got it in >>> system space) said that she got it from >>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar. >>> >>> Thanks, >>> David >>> >>> >>> On 08/12/2015 10:51 AM, Deva wrote: >>> >>> From where did you grab this HCOLL lib? MOFED or HPCX? what version? >>> >>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < <dshra...@lanl.gov> >>> dshra...@lanl.gov> wrote: >>> >>>> Hey Devendar, >>>> >>>> It looks like I still get the error: >>>> >>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out >>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs >>>> [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN >>>> Could not open the KNEM device file at /dev/knem : No such file or direc >>>> tory. Won't use knem. >>>> [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN >>>> Could not open the KNEM device file at /dev/knem : No such file or direc >>>> tory. Won't use knem. >>>> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) >>>> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) >>>> ==== backtrace ==== >>>> 2 0x0000000000056cdc mxm_handle_error() >>>> >>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h >>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 >>>> >>>> 3 0x0000000000056e4c mxm_error_signal_handler() >>>> >>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro >>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 >>>> >>>> 4 0x00000000000326a0 killpg() ??:0 >>>> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 >>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 >>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() >>>> coll_ml_module.c:0 >>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 >>>> 9 0x000000000006ace9 hcoll_create_context() ??:0 >>>> 10 0x00000000000f9706 mca_coll_hcoll_comm_query() ??:0 >>>> 11 0x00000000000f684e mca_coll_base_comm_select() ??:0 >>>> 12 0x0000000000073fc4 ompi_mpi_init() ??:0 >>>> 13 0x0000000000092ea0 PMPI_Init() ??:0 >>>> 14 0x00000000004009b6 main() ??:0 >>>> 15 0x000000000001ed5d __libc_start_main() ??:0 >>>> 16 0x00000000004008c9 _start() ??:0 >>>> =================== >>>> ==== backtrace ==== >>>> 2 0x0000000000056cdc mxm_handle_error() >>>> >>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h >>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 >>>> >>>> 3 0x0000000000056e4c mxm_error_signal_handler() >>>> >>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro >>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 >>>> >>>> 4 0x00000000000326a0 killpg() ??:0 >>>> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 >>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 >>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() >>>> coll_ml_module.c:0 >>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 >>>> 9 0x000000000006ace9 hcoll_create_context() ??:0 >>>> 10 0x00000000000f9706 mca_coll_hcoll_comm_query() ??:0 >>>> 11 0x00000000000f684e mca_coll_base_comm_select() ??:0 >>>> 12 0x0000000000073fc4 ompi_mpi_init() ??:0 >>>> 13 0x0000000000092ea0 PMPI_Init() ??:0 >>>> 14 0x00000000004009b6 main() ??:0 >>>> 15 0x000000000001ed5d __libc_start_main() ??:0 >>>> 16 0x00000000004008c9 _start() ??:0 >>>> =================== >>>> -------------------------------------------------------------------------- >>>> >>>> mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited >>>> on signal 11 (Segmentation fault). >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> Thanks, >>>> David >>>> >>>> On 08/12/2015 10:42 AM, Deva wrote: >>>> >>>> Hi David, >>>> >>>> This issue is from hcoll library. This could be because of symbol >>>> conflict with ml module. This is fixed recently in HCOLL. Can you try >>>> with "-mca coll ^ml" and see if this workaround works in your setup? >>>> >>>> -Devendar >>>> >>>> On Wed, Aug 12, 2015 at 9:30 AM, David Shrader < <dshra...@lanl.gov> >>>> dshra...@lanl.gov> wrote: >>>> >>>>> Hello Gilles, >>>>> >>>>> Thank you very much for the patch! It is much more complete than mine. >>>>> Using that patch and re-running autogen.pl, I am able to build 1.8.8 >>>>> with './configure --with-hcoll' without errors. >>>>> >>>>> I do have issues when it comes to running 1.8.8 with hcoll built in, >>>>> however. In my quick sanity test of running a basic parallel hello world C >>>>> program, I get the following: >>>>> >>>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out >>>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs >>>>> [1439390789.039197] [zo-fe1:31354:0] shm.c:65 MXM WARN >>>>> Could not open the KNEM device file at /dev/knem : No such file or direc >>>>> tory. Won't use knem. >>>>> [1439390789.040265] [zo-fe1:31353:0] shm.c:65 MXM WARN >>>>> Could not open the KNEM device file at /dev/knem : No such file or direc >>>>> tory. Won't use knem. >>>>> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault) >>>>> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault) >>>>> ==== backtrace ==== >>>>> 2 0x0000000000056cdc mxm_handle_error() >>>>> >>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h >>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 >>>>> >>>>> 3 0x0000000000056e4c mxm_error_signal_handler() >>>>> >>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro >>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 >>>>> >>>>> 4 0x00000000000326a0 killpg() ??:0 >>>>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers() ??:0 >>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 >>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() >>>>> coll_ml_module.c:0 >>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 >>>>> 9 0x000000000006ace9 hcoll_create_context() ??:0 >>>>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query() ??:0 >>>>> 11 0x00000000000f776e mca_coll_base_comm_select() ??:0 >>>>> 12 0x0000000000074ee4 ompi_mpi_init() ??:0 >>>>> 13 0x0000000000093dc0 PMPI_Init() ??:0 >>>>> 14 0x00000000004009b6 main() ??:0 >>>>> 15 0x000000000001ed5d __libc_start_main() ??:0 >>>>> 16 0x00000000004008c9 _start() ??:0 >>>>> =================== >>>>> ==== backtrace ==== >>>>> 2 0x0000000000056cdc mxm_handle_error() >>>>> >>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h >>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 >>>>> >>>>> 3 0x0000000000056e4c mxm_error_signal_handler() >>>>> >>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro >>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 >>>>> >>>>> 4 0x00000000000326a0 killpg() ??:0 >>>>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers() ??:0 >>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 >>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() >>>>> coll_ml_module.c:0 >>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 >>>>> 9 0x000000000006ace9 hcoll_create_context() ??:0 >>>>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query() ??:0 >>>>> 11 0x00000000000f776e mca_coll_base_comm_select() ??:0 >>>>> 12 0x0000000000074ee4 ompi_mpi_init() ??:0 >>>>> 13 0x0000000000093dc0 PMPI_Init() ??:0 >>>>> 14 0x00000000004009b6 main() ??:0 >>>>> 15 0x000000000001ed5d __libc_start_main() ??:0 >>>>> 16 0x00000000004008c9 _start() ??:0 >>>>> =================== >>>>> -------------------------------------------------------------------------- >>>>> >>>>> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 >>>>> exited on signal 11 (Segmentation fault). >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> I do not get this message with only 1 process. >>>>> >>>>> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or >>>>> something with my ompi build? >>>>> >>>>> Thanks, >>>>> David >>>>> >>>>> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote: >>>>> >>>>> Thanks David, >>>>> >>>>> i made a PR for the v1.8 branch at >>>>> <https://github.com/open-mpi/ompi-release/pull/492> >>>>> https://github.com/open-mpi/ompi-release/pull/492 >>>>> >>>>> the patch is attached (it required some back-porting) >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On 8/12/2015 4:01 AM, David Shrader wrote: >>>>> >>>>> I have cloned Gilles' topic/hcoll_config branch and, after running >>>>> autogen.pl, have found that './configure --with-hcoll' does indeed >>>>> work now. I used Gilles' branch as I wasn't sure how best to get the pull >>>>> request changes in to my own clone of master. It looks like the proper >>>>> checks are happening, too: >>>>> >>>>> --- MCA component coll:hcoll (m4 configuration macro) >>>>> checking for MCA component coll:hcoll compile mode... dso >>>>> checking --with-hcoll value... simple ok (unspecified) >>>>> checking hcoll/api/hcoll_api.h usability... yes >>>>> checking hcoll/api/hcoll_api.h presence... yes >>>>> checking for hcoll/api/hcoll_api.h... yes >>>>> looking for library without search path >>>>> checking for library containing hcoll_get_version... -lhcoll >>>>> checking if MCA component coll:hcoll can compile... yes >>>>> >>>>> I haven't checked whether or not Open MPI builds successfully as I >>>>> don't have much experience running off of the latest source. For now, I >>>>> think I will try to generate a patch to the 1.8.8 configure script and see >>>>> if that works as expected. >>>>> >>>>> Thanks, >>>>> David >>>>> >>>>> On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote: >>>>> >>>>> On Aug 11, 2015, at 1:39 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> >>>>> <ake.sandg...@hpc2n.umu.se> wrote: >>>>> >>>>> Please fix the hcoll test (and code) to be correct. >>>>> >>>>> Any configure test that adds /usr/lib and/or /usr/include to any compile >>>>> flags is broken. >>>>> >>>>> +1 >>>>> >>>>> Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some >>>>> comments to it. >>>>> >>>>> >>>>> >>>>> -- >>>>> David Shrader >>>>> HPC-3 High Performance Computer Systems >>>>> Los Alamos National Lab >>>>> Email: dshrader <at> lanl.gov >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing listus...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/08/27432.php >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing listus...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/08/27434.php >>>>> >>>>> >>>>> -- >>>>> David Shrader >>>>> HPC-3 High Performance Computer Systems >>>>> Los Alamos National Lab >>>>> Email: dshrader <at> lanl.gov >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> <us...@open-mpi.org>us...@open-mpi.org >>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> <http://www.open-mpi.org/community/lists/users/2015/08/27438.php> >>>>> http://www.open-mpi.org/community/lists/users/2015/08/27438.php >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> -Devendar >>>> >>>> >>>> _______________________________________________ >>>> users mailing listus...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27439.php >>>> >>>> >>>> -- >>>> David Shrader >>>> HPC-3 High Performance Computer Systems >>>> Los Alamos National Lab >>>> Email: dshrader <at> lanl.gov >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> <http://www.open-mpi.org/community/lists/users/2015/08/27440.php> >>>> http://www.open-mpi.org/community/lists/users/2015/08/27440.php >>>> >>> >>> >>> >>> -- >>> >>> >>> -Devendar >>> >>> >>> _______________________________________________ >>> users mailing listus...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/08/27441.php >>> >>> >>> -- >>> David Shrader >>> HPC-3 High Performance Computer Systems >>> Los Alamos National Lab >>> Email: dshrader <at> lanl.gov >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> <http://www.open-mpi.org/community/lists/users/2015/08/27445.php> >>> http://www.open-mpi.org/community/lists/users/2015/08/27445.php >>> >> >> >> >> -- >> >> >> -Devendar >> >> >> -- >> David Shrader >> HPC-3 High Performance Computer Systems >> Los Alamos National Lab >> Email: dshrader <at> lanl.gov >> >> > > > -- > > > -Devendar > > > -- > David Shrader > HPC-3 High Performance Computer Systems > Los Alamos National Lab > Email: dshrader <at> lanl.gov > > -- -Devendar