David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:
I remember seeing those, but forgot about them. I am curious, though, why using '-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when that release may happen? Thank you for the help!
David

On 08/12/2015 04:04 PM, Deva wrote:
David,

This is because of hcoll symbols conflict with ml coll module inside OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll library and will be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:

    Interesting... the seg faults went away:

    [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
    [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
    App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
    [1439416182.732720] [zo-fe1:14690:0]         shm.c:65   MXM  WARN
     Could not open the KNEM device file at /dev/knem : No such file
    or direc
    tory. Won't use knem.
    [1439416182.733640] [zo-fe1:14689:0]         shm.c:65   MXM  WARN
     Could not open the KNEM device file at /dev/knem : No such file
    or direc
    tory. Won't use knem.
    0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov>
    0: We have 2 processors
    0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
    <http://zo-fe1.lanl.gov> reporting for duty

    This implies to me that some other library is being used instead
    of /usr/lib64/libhcoll.so, but I am not sure how that could be...

    Thanks,
    David

    On 08/12/2015 03:30 PM, Deva wrote:
    Hi David,

    I tried same tarball on OFED-1.5.4.1 and I could not reproduce
    the issue.  Can you do one more quick test with seeing
    LD_PRELOAD to hcoll lib?

    $LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so> mpirun -n 2 -mca
    coll ^ml ./a.out

    -Devendar

    On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
    <dshra...@lanl.gov> wrote:

        The admin that rolled the hcoll rpm that we're using (and
        got it in system space) said that she got it from
        hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

        Thanks,
        David


        On 08/12/2015 10:51 AM, Deva wrote:
        From where did you grab this HCOLL lib?  MOFED or HPCX?
        what version?

        On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
        <dshra...@lanl.gov> wrote:

            Hey Devendar,

            It looks like I still get the error:

            [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
            App launch reported: 1 (out of 1) daemons - 2 (out of
            2) procs
            [1439397957.351764] [zo-fe1:14678:0]         shm.c:65
              MXM  WARN  Could not open the KNEM device file at
            /dev/knem : No such file or direc
            tory. Won't use knem.
            [1439397957.352704] [zo-fe1:14677:0]         shm.c:65
              MXM  WARN  Could not open the KNEM device file at
            /dev/knem : No such file or direc
            tory. Won't use knem.
            [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
            [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
            ==== backtrace ====
            2 0x0000000000056cdc mxm_handle_error()
             
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
            
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

            3 0x0000000000056e4c mxm_error_signal_handler()
             
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
            
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

            4 0x00000000000326a0 killpg()  ??:0
            5 0x00000000000b82cb
            base_bcol_basesmuma_setup_library_buffers()  ??:0
            6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()
             ??:0
            7 0x0000000000032ee3
            hmca_coll_ml_tree_hierarchy_discovery()
             coll_ml_module.c:0
            8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
            9 0x000000000006ace9 hcoll_create_context()  ??:0
            10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
            11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
            12 0x0000000000073fc4 ompi_mpi_init()  ??:0
            13 0x0000000000092ea0 PMPI_Init()  ??:0
            14 0x00000000004009b6 main()  ??:0
            15 0x000000000001ed5d __libc_start_main()  ??:0
            16 0x00000000004008c9 _start()  ??:0
            ===================
            ==== backtrace ====
            2 0x0000000000056cdc mxm_handle_error()
             
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
            
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

            3 0x0000000000056e4c mxm_error_signal_handler()
             
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
            
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

            4 0x00000000000326a0 killpg()  ??:0
            5 0x00000000000b82cb
            base_bcol_basesmuma_setup_library_buffers()  ??:0
            6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()
             ??:0
            7 0x0000000000032ee3
            hmca_coll_ml_tree_hierarchy_discovery()
             coll_ml_module.c:0
            8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
            9 0x000000000006ace9 hcoll_create_context()  ??:0
            10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
            11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
            12 0x0000000000073fc4 ompi_mpi_init()  ??:0
            13 0x0000000000092ea0 PMPI_Init()  ??:0
            14 0x00000000004009b6 main()  ??:0
            15 0x000000000001ed5d __libc_start_main()  ??:0
            16 0x00000000004008c9 _start()  ??:0
            ===================
            
--------------------------------------------------------------------------

            mpirun noticed that process rank 1 with PID 14678 on
            node zo-fe1 exited on signal 11 (Segmentation fault).
            
--------------------------------------------------------------------------

            Thanks,
            David

            On 08/12/2015 10:42 AM, Deva wrote:
            Hi David,

            This issue is from hcoll library. This could be
            because of symbol conflict with ml module.  This is
            fixed recently in HCOLL.  Can you try with "-mca coll
            ^ml" and see if this workaround works in your setup?

            -Devendar

            On Wed, Aug 12, 2015 at 9:30 AM, David Shrader
            <dshra...@lanl.gov> wrote:

                Hello Gilles,

                Thank you very much for the patch! It is much more
                complete than mine. Using that patch and
                re-running autogen.pl <http://autogen.pl>, I am
                able to build 1.8.8 with './configure
                --with-hcoll' without errors.

                I do have issues when it comes to running 1.8.8
                with hcoll built in, however. In my quick sanity
                test of running a basic parallel hello world C
                program, I get the following:

                [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
                App launch reported: 1 (out of 1) daemons - 2 (out
                of 2) procs
                [1439390789.039197] [zo-fe1:31354:0]
                        shm.c:65   MXM  WARN  Could not open the
                KNEM device file at /dev/knem : No such file or direc
                tory. Won't use knem.
                [1439390789.040265] [zo-fe1:31353:0]
                        shm.c:65   MXM  WARN  Could not open the
                KNEM device file at /dev/knem : No such file or direc
                tory. Won't use knem.
                [zo-fe1:31353:0] Caught signal 11 (Segmentation
                fault)
                [zo-fe1:31354:0] Caught signal 11 (Segmentation
                fault)
                ==== backtrace ====
                2 0x0000000000056cdc mxm_handle_error()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
                
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

                3 0x0000000000056e4c mxm_error_signal_handler()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
                
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

                4 0x00000000000326a0 killpg()  ??:0
                5 0x00000000000b91eb
                base_bcol_basesmuma_setup_library_buffers()  ??:0
                6 0x00000000000969e3
                hmca_bcol_basesmuma_comm_query()  ??:0
                7 0x0000000000032ee3
                hmca_coll_ml_tree_hierarchy_discovery()
                 coll_ml_module.c:0
                8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
                9 0x000000000006ace9 hcoll_create_context()  ??:0
                10 0x00000000000fa626 mca_coll_hcoll_comm_query()
                 ??:0
                11 0x00000000000f776e mca_coll_base_comm_select()
                 ??:0
                12 0x0000000000074ee4 ompi_mpi_init()  ??:0
                13 0x0000000000093dc0 PMPI_Init()  ??:0
                14 0x00000000004009b6 main()  ??:0
                15 0x000000000001ed5d __libc_start_main()  ??:0
                16 0x00000000004008c9 _start()  ??:0
                ===================
                ==== backtrace ====
                2 0x0000000000056cdc mxm_handle_error()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
                
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

                3 0x0000000000056e4c mxm_error_signal_handler()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
                
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

                4 0x00000000000326a0 killpg()  ??:0
                5 0x00000000000b91eb
                base_bcol_basesmuma_setup_library_buffers()  ??:0
                6 0x00000000000969e3
                hmca_bcol_basesmuma_comm_query()  ??:0
                7 0x0000000000032ee3
                hmca_coll_ml_tree_hierarchy_discovery()
                 coll_ml_module.c:0
                8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
                9 0x000000000006ace9 hcoll_create_context()  ??:0
                10 0x00000000000fa626 mca_coll_hcoll_comm_query()
                 ??:0
                11 0x00000000000f776e mca_coll_base_comm_select()
                 ??:0
                12 0x0000000000074ee4 ompi_mpi_init()  ??:0
                13 0x0000000000093dc0 PMPI_Init()  ??:0
                14 0x00000000004009b6 main()  ??:0
                15 0x000000000001ed5d __libc_start_main()  ??:0
                16 0x00000000004008c9 _start()  ??:0
                ===================
                
--------------------------------------------------------------------------

                mpirun noticed that process rank 0 with PID 31353
                on node zo-fe1 exited on signal 11 (Segmentation
                fault).
                
--------------------------------------------------------------------------

                I do not get this message with only 1 process.

                I am using hcoll 3.2.748. Could this be an issue
                with hcoll itself or something with my ompi build?

                Thanks,
                David

                On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
                Thanks David,

                i made a PR for the v1.8 branch at
                https://github.com/open-mpi/ompi-release/pull/492

                the patch is attached (it required some back-porting)

                Cheers,

                Gilles

                On 8/12/2015 4:01 AM, David Shrader wrote:
                I have cloned Gilles' topic/hcoll_config branch
                and, after running autogen.pl
                <http://autogen.pl>, have found that
                './configure --with-hcoll' does indeed work now.
                I used Gilles' branch as I wasn't sure how best
                to get the pull request changes in to my own
                clone of master. It looks like the proper checks
                are happening, too:

                --- MCA component coll:hcoll(m4 configuration
                macro)
                checking for MCA component coll:hcollcompile
                mode... dso
                checking --with-hcollvalue... simple ok
                (unspecified)
                checking hcoll/api/hcoll_api.h usability... yes
                checking hcoll/api/hcoll_api.h presence... yes
                checking for hcoll/api/hcoll_api.h... yes
                looking for library without search path
                checking for library containing
                hcoll_get_version... -lhcoll
                checking if MCA component coll:hcollcan
                compile... yes

                I haven't checked whether or not Open MPI builds
                successfully as I don't have much experience
                running off of the latest source. For now, I
                think I will try to generate a patch to the
                1.8.8 configure script and see if that works as
                expected.

                Thanks,
                David

                On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres)
                wrote:
                On Aug 11, 2015, at 1:39 AM, Åke 
Sandgren<ake.sandg...@hpc2n.umu.se>
                <mailto:ake.sandg...@hpc2n.umu.se>  wrote:
                Please fix the hcoll test (and code) to be correct.

                Any configure test that adds /usr/lib and/or /usr/include to 
any compile flags is broken.
                +1

                Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I just 
added some comments to it.


-- David Shrader
                HPC-3 High Performance Computer Systems
                Los Alamos National Lab
                Email: dshrader <at>lanl.gov <http://lanl.gov>


                _______________________________________________
                users mailing list
                us...@open-mpi.org <mailto:us...@open-mpi.org>
                Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
                Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27432.php



                _______________________________________________
                users mailing list us...@open-mpi.org
                <mailto:us...@open-mpi.org> Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/users

                Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27434.php

-- David Shrader
                HPC-3 High Performance Computer Systems
                Los Alamos National Lab
                Email: dshrader <at>lanl.gov <http://lanl.gov>


                _______________________________________________
                users mailing list
                us...@open-mpi.org
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/users
                Link to this post:
                http://www.open-mpi.org/community/lists/users/2015/08/27438.php




--

            -Devendar


            _______________________________________________ users
            mailing list us...@open-mpi.org
            <mailto:us...@open-mpi.org> Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users

            Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27439.php

-- David Shrader
            HPC-3 High Performance Computer Systems
            Los Alamos National Lab
            Email: dshrader <at>lanl.gov <http://lanl.gov>


            _______________________________________________
            users mailing list
            us...@open-mpi.org <mailto:us...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users
            Link to this post:
            http://www.open-mpi.org/community/lists/users/2015/08/27440.php




--

        -Devendar


        _______________________________________________ users
        mailing list us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription:
        http://www.open-mpi.org/mailman/listinfo.cgi/users

        Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27441.php

-- David Shrader
        HPC-3 High Performance Computer Systems
        Los Alamos National Lab
        Email: dshrader <at>lanl.gov <http://lanl.gov>


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2015/08/27445.php




--

    -Devendar

-- David Shrader
    HPC-3 High Performance Computer Systems
    Los Alamos National Lab
    Email: dshrader <at>lanl.gov <http://lanl.gov>




--


-Devendar

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27448.php

Reply via email to