Interestingly enough, I have found that using --disable-dlopen causes the seg fault whether or not --enable-mca-no-build=coll-ml is used. That is, the following configure line generates a build of Open MPI that will *not* seg fault when running a simple hello world program:

./configure --prefix=/tmp/dshrader-ompi-1.8.8-install --enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll

While the following configure line will produce a build of Open MPI that *will* seg fault with the same error I mentioned before:

./configure --prefix=/tmp/dshrader-ompi-1.8.8-install --enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll --disable-dlopen

I'm not sure why this would be.

Thanks,
David

On 08/13/2015 11:19 AM, Jeff Squyres (jsquyres) wrote:
Ah, if you're disable-dlopen, then you won't find individual plugin DSOs.

Instead, you can configure this way:

     ./configure --enable-mca-no-build=coll-ml ...

This will disable the build of the coll/ml component altogether.


On Aug 13, 2015, at 11:23 AM, David Shrader <dshra...@lanl.gov> wrote:

Hey Jeff,

I'm actually not able to find coll_ml related files at that location. All I see 
are the following files:

[dshrader@zo-fe1 openmpi]$ ls 
/usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/
libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so

In this particular build, I am using platform files instead of the stripped 
down debug builds I was doing before. Could something in the platform files 
move or combine with something else the coll_ml related files?

Thanks,
David

On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:
Note that this will require you to have fairly recent GNU Autotools installed.

Another workaround for avoiding the coll ml module would be to install Open MPI 
as normal, and then rm the following files after installation:

    rm $prefix/lib/openmpi/mca_coll_ml*

This will physically remove the coll ml plugin from the Open MPI installation, 
and therefore it won't/can't be used (or interfere with the hcoll plugin).


On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:

David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:
I remember seeing those, but forgot about them. I am curious, though, why using 
'-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when that release may 
happen? Thank you for the help!
David

On 08/12/2015 04:04 PM, Deva wrote:
David,

This is because of hcoll symbols conflict with ml coll module inside OMPI. 
HCOLL is derived from ml module. This issue is fixed in hcoll library and will 
be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov> wrote:
Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty

This implies to me that some other library is being used instead of 
/usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:
Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.  Can 
you do one more quick test with seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so>  mpirun -n 2  -mca coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader <dshra...@lanl.gov> wrote:
The admin that rolled the hcoll rpm that we're using (and got it in system 
space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:
 From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <dshra...@lanl.gov> wrote:
Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg()  ??:0
5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x000000000006ace9 hcoll_create_context()  ??:0
10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
12 0x0000000000073fc4 ompi_mpi_init()  ??:0
13 0x0000000000092ea0 PMPI_Init()  ??:0
14 0x00000000004009b6 main()  ??:0
15 0x000000000001ed5d __libc_start_main()  ??:0
16 0x00000000004008c9 _start()  ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg()  ??:0
5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x000000000006ace9 hcoll_create_context()  ??:0
10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
12 0x0000000000073fc4 ompi_mpi_init()  ??:0
13 0x0000000000092ea0 PMPI_Init()  ??:0
14 0x00000000004009b6 main()  ??:0
15 0x000000000001ed5d __libc_start_main()  ??:0
16 0x00000000004008c9 _start()  ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on 
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Thanks,
David

On 08/12/2015 10:42 AM, Deva wrote:
Hi David,

This issue is from hcoll library. This could be because of symbol conflict with ml 
module.  This is fixed recently in HCOLL.  Can you try with "-mca coll ^ml" and 
see if this workaround works in your setup?

-Devendar

On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <dshra...@lanl.gov> wrote:
Hello Gilles,

Thank you very much for the patch! It is much more complete than mine. Using 
that patch and re-running autogen.pl, I am able to build 1.8.8 with 
'./configure --with-hcoll' without errors.

I do have issues when it comes to running 1.8.8 with hcoll built in, however. 
In my quick sanity test of running a basic parallel hello world C program, I 
get the following:

[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg()  ??:0
5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x000000000006ace9 hcoll_create_context()  ??:0
10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
12 0x0000000000074ee4 ompi_mpi_init()  ??:0
13 0x0000000000093dc0 PMPI_Init()  ??:0
14 0x00000000004009b6 main()  ??:0
15 0x000000000001ed5d __libc_start_main()  ??:0
16 0x00000000004008c9 _start()  ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg()  ??:0
5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x000000000006ace9 hcoll_create_context()  ??:0
10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
12 0x0000000000074ee4 ompi_mpi_init()  ??:0
13 0x0000000000093dc0 PMPI_Init()  ??:0
14 0x00000000004009b6 main()  ??:0
15 0x000000000001ed5d __libc_start_main()  ??:0
16 0x00000000004008c9 _start()  ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited on 
signal 11 (Segmentation                                                         
  fault).
--------------------------------------------------------------------------

I do not get this message with only 1 process.

I am using hcoll 3.2.748. Could this be an issue with hcoll itself or something 
with my ompi build?

Thanks,
David

On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
Thanks David,

i made a PR for the v1.8 branch at 
https://github.com/open-mpi/ompi-release/pull/492

the patch is attached (it required some back-porting)

Cheers,

Gilles

On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config branch and, after running autogen.pl, 
have found that './configure --with-hcoll' does indeed work now. I used Gilles' 
branch as I wasn't sure how best to get the pull request changes in to my own 
clone of master. It looks like the proper checks are happening, too:

--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... dso
checking --with-hcoll value... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcoll can compile... yes

I haven't checked whether or not Open MPI builds successfully as I don't have 
much experience running off of the latest source. For now, I think I will try 
to generate a patch to the 1.8.8 configure script and see if that works as 
expected.

Thanks,
David

On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:
On Aug 11, 2015, at 1:39 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se>
  wrote:

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile flags 
is broken.

+1

Gilles filed
https://github.com/open-mpi/ompi/pull/796
; I just added some comments to it.


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>
lanl.gov


_______________________________________________
users mailing list

us...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27432.php

_______________________________________________
users mailing list

us...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27434.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>
lanl.gov

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27438.php



--


-Devendar


_______________________________________________
users mailing list

us...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27439.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>
lanl.gov

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27440.php



--


-Devendar


_______________________________________________
users mailing list

us...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27441.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>
lanl.gov

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27445.php



--


-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>
lanl.gov



--


-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov



_______________________________________________
users mailing list

us...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27448.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27453.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27457.php


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

Reply via email to