Hi George

thanks for your answer. I was previously using OpenMPI 3.1.2 and have also this problem. However, using --enable-debug --enable-mem-debugat configuration time, I was unable to reproduce the failure and it was quite difficult for me do trace the problem. May be I have not run enought tests to reach the failure point.

I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x version. The problem was still there but with the debug config I was able to trace the call stack.

Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the git repo ?

Thanks

Patrick

George Bosilca wrote:
Few days ago we have pushed a fix in master for a strikingly similar issue. The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x series. The best path forward will be to migrate to a more recent OMPI version.

George.


On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <patrick.be...@legi.grenoble-inp.fr <mailto:patrick.be...@legi.grenoble-inp.fr>> wrote:

    Hi

    I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
    7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
    Same binary, same server, same number of processes (16), same parameters
    for the run. Sometimes it runs until the end, sometime I get  'invalid
    memory reference'.

    Building the application and OpenMPI in debug mode I saw that this random
    segfault always occur in collective communications inside OpenMPI. I've no
    idea howto track this. These are 2 call stack traces (just the openmpi 
part):

    *Calling  MPI_ALLREDUCE(...)**
    *
    Program received signal SIGSEGV: Segmentation fault - invalid memory
    reference.

    Backtrace for this error:
    #0  0x7f01937022ef in ???
    #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
        at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
    #2  0x7f0192dd0331 in mca_btl_vader_component_progress
        at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
    #3  0x7f0192d6b92b in opal_progress
        at ../../opal/runtime/opal_progress.c:226
    #4  0x7f0194a8a9a4 in sync_wait_st
        at ../../opal/threads/wait_sync.h:80
    #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
        at ../../ompi/request/req_wait.c:221
    #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
        at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
    #7  0x7f0194aa0a0a in PMPI_Allreduce
        at
    
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
    #8  0x7f0194f2e2ba in ompi_allreduce_f
        at
    
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
    #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
        at linear_solver_deflation_m.f90:341


    *Calling MPI_WAITALL()*

    Program received signal SIGSEGV: Segmentation fault - invalid memory
    reference.

    Backtrace for this error:
    #0  0x7fda5a8d72ef in ???
    #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
        at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
    #2  0x7fda59fa5331 in mca_btl_vader_component_progress
        at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
    #3  0x7fda59f4092b in opal_progress
        at ../../opal/runtime/opal_progress.c:226
    #4  0x7fda5bc5f9a4 in sync_wait_st
        at ../../opal/threads/wait_sync.h:80
    #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
        at ../../ompi/request/req_wait.c:221
    #6  0x7fda5bca329e in PMPI_Waitall
        at
    
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
    #7  0x7fda5c10bc00 in ompi_waitall_f
        at
    
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
    #8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
        at data_comm_m.f90:5849


    The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
    207                /* call the registered callback function */
    208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, reg->cbdata);


    OpenMPI 2.1.5 is build with:
    CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
    -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
    ../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default
    --disable-dlopen \
    --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
    --without-slurm --enable-mpi-thread-multiple  --enable-debug
    --enable-mem-debug

    Any help appreciated

    Patrick

-- ===================================================================
    |  Equipe M.O.S.T.         |                                      |
    |  Patrick BEGOU           |mailto:patrick.be...@grenoble-inp.fr  |
    |  LEGI                    |                                      |
    |  BP 53 X                 | Tel 04 76 82 51 35                   |
    |  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71                   |
    ===================================================================

    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


--
===================================================================
|  Equipe M.O.S.T.         |                                      |
|  Patrick BEGOU           | mailto:patrick.be...@grenoble-inp.fr |
|  LEGI                    |                                      |
|  BP 53 X                 | Tel 04 76 82 51 35                   |
|  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71                   |
===================================================================

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to