Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

Maxime Boissonneault Fri, 15 Aug 2014 11:57:59 -0400 (EDT)

Hi Josh,
The ring_c example does not work on our login node :
[mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c
[mboisson@helios-login1 examples]$ echo $?
65


[mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:/software-gpu/cuda/6.0.37/lib64:/software-gpu/cuda/6.0.37/lib:/software6/compilers/gcc/4.8/lib64:/software6/compilers/gcc/4.8/lib:/software6/apps/buildtools/20140527/lib64:/software6/apps/buildtools/20140527/lib


It does work on our compute nodes however.

If I compile and run this with OpenMPI 1.6.5, it gives a warning, but itdoes work on our login note :

[mboisson@helios-login1 examples]$ mpiexec ring_c
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              helios-login1
  Registerable memory:     32768 MiB
  Total memory:            65457 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting



Could the warning be causing a failure with OpenMPI 1.8.x ?

I suspect it does work on our compute nodes because they are configuredto allow more locked pages. I do not understand however how a simplering test should require that much memory.



Maxime




Le 2014-08-14 15:16, Joshua Ladd a écrit :

Can you try to run the example code "ring_c" across nodes?

Josh

On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault<maxime.boissonnea...@calculquebec.ca<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:


    Yes,
    Everything has been built with GCC 4.8.x, although x might have
    changed between the OpenMPI 1.8.1 build and the gromacs build. For
    OpenMPI 1.8.2rc4 however, it was the exact same compiler for
    everything.

    Maxime

    Le 2014-08-14 14:57, Joshua Ladd a écrit :

    Hmmm...weird. Seems like maybe a mismatch between libraries. Did
    you build OMPI with the same compiler as you did GROMACS/Charm++?

    I'm stealing this suggestion from an old Gromacs forum with
    essentially the same symptom:

    "Did you compile Open MPI and Gromacs with the same compiler
    (i.e. both gcc and the same version)? You write you tried
    different OpenMPI versions and different GCC versions but it is
    unclear whether those match. Can you provide more detail how you
    compiled (including all options you specified)? Have you tested
    any other MPI program linked against those Open MPI versions?
    Please make sure (e.g. with ldd) that the MPI and pthread library
    you compiled against is also used for execution. If you compiled
    and run on different hosts, check whether the error still occurs
    when executing on the build host."

    http://redmine.gromacs.org/issues/1025

    Josh




    On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault
    <maxime.boissonnea...@calculquebec.ca
    <mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

        I just tried Gromacs with two nodes. It crashes, but with a
        different error. I get
        [gpu-k20-13:142156] *** Process received signal ***
        [gpu-k20-13:142156] Signal: Segmentation fault (11)
        [gpu-k20-13:142156] Signal code: Address not mapped (1)
        [gpu-k20-13:142156] Failing at address: 0x8
        [gpu-k20-13:142156] [ 0]
        /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
        [gpu-k20-13:142156] [ 1]
        /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
        [gpu-k20-13:142156] [ 2]
        /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83]
        [gpu-k20-13:142156] [ 3]
        /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da]
        [gpu-k20-13:142156] [ 4]
        /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933]
        [gpu-k20-13:142156] [ 5]
        
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965]
        [gpu-k20-13:142156] [ 6]
        
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a]
        [gpu-k20-13:142156] [ 7]
        
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b]
        [gpu-k20-13:142156] [ 8]
        
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a]
        [gpu-k20-13:142156] [ 9]
        
/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5]
        [gpu-k20-13:142156] [10]
        
/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be]
        [gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb]
        [gpu-k20-13:142156] [12]
        /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d]
        [gpu-k20-13:142156] [13] mdrunmpi[0x407be1]
        [gpu-k20-13:142156] *** End of error message ***
        
--------------------------------------------------------------------------
        mpiexec noticed that process rank 0 with PID 142156 on node
        gpu-k20-13 exited on signal 11 (Segmentation fault).
        
--------------------------------------------------------------------------



        We do not have MPI_THREAD_MULTIPLE enabled in our build, so
        Charm++ cannot be using this level of threading. The
        configure line for OpenMPI was
        ./configure --prefix=$PREFIX \
              --with-threads --with-verbs=yes --enable-shared
        --enable-static \
        --with-io-romio-flags="--with-file-system=nfs+lustre" \
               --without-loadleveler --without-slurm --with-tm \
               --with-cuda=$(dirname $(dirname $(which nvcc)))

        Maxime


        Le 2014-08-14 14:20, Joshua Ladd a écrit :

        What about between nodes? Since this is coming from the
        OpenIB BTL, would be good to check this.

        Do you know what the MPI thread level is set to when used
        with the Charm++ runtime? Is it MPI_THREAD_MULTIPLE? The
        OpenIB BTL is not thread safe.

        Josh


        On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boissonneault
        <maxime.boissonnea...@calculquebec.ca
        <mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

            Hi,
            I ran gromacs successfully with OpenMPI 1.8.1 and Cuda
            6.0.37 on a single node, with 8 ranks and multiple
            OpenMP threads.

            Maxime


            Le 2014-08-14 14:15, Joshua Ladd a écrit :

            Hi, Maxime

            Just curious, are you able to run a vanilla MPI
            program? Can you try one one of the example programs in
            the "examples" subdirectory. Looks like a threading
            issue to me.

            Thanks,

            Josh



            _______________________________________________ users
            mailing list us...@open-mpi.org
            <mailto:us...@open-mpi.org> Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users

            Link to this 
post:http://www.open-mpi.org/community/lists/users/2014/08/25023.php




            _______________________________________________
            users mailing list
            us...@open-mpi.org <mailto:us...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users
            Link to this post:
            http://www.open-mpi.org/community/lists/users/2014/08/25024.php




        _______________________________________________ users
        mailing list us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

        Link to this 
post:http://www.open-mpi.org/community/lists/users/2014/08/25025.php

-----------------------------------

        Maxime Boissonneault
        Analyste de calcul - Calcul Québec, Université Laval
        Ph. D. en physique


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2014/08/25026.php




    _______________________________________________ users mailing
    list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription:
    http://www.open-mpi.org/mailman/listinfo.cgi/users

    Link to this 
post:http://www.open-mpi.org/community/lists/users/2014/08/25027.php

-----------------------------------

    Maxime Boissonneault
    Analyste de calcul - Calcul Québec, Université Laval
    Ph. D. en physique


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2014/08/25028.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25029.php



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

Reply via email to