Few days ago we have pushed a fix in master for a strikingly similar issue. The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x series. The best path forward will be to migrate to a more recent OMPI version.
George. On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou < patrick.be...@legi.grenoble-inp.fr> wrote: > Hi > > I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc > 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults. > Same binary, same server, same number of processes (16), same parameters > for the run. Sometimes it runs until the end, sometime I get 'invalid > memory reference'. > > Building the application and OpenMPI in debug mode I saw that this random > segfault always occur in collective communications inside OpenMPI. I've no > idea howto track this. These are 2 call stack traces (just the openmpi > part): > > *Calling MPI_ALLREDUCE(...)* > > Program received signal SIGSEGV: Segmentation fault - invalid memory > reference. > > Backtrace for this error: > #0 0x7f01937022ef in ??? > #1 0x7f0192dd0331 in mca_btl_vader_check_fboxes > at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208 > #2 0x7f0192dd0331 in mca_btl_vader_component_progress > at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689 > #3 0x7f0192d6b92b in opal_progress > at ../../opal/runtime/opal_progress.c:226 > #4 0x7f0194a8a9a4 in sync_wait_st > at ../../opal/threads/wait_sync.h:80 > #5 0x7f0194a8a9a4 in ompi_request_default_wait_all > at ../../ompi/request/req_wait.c:221 > #6 0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling > at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225 > #7 0x7f0194aa0a0a in PMPI_Allreduce > at > /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107 > #8 0x7f0194f2e2ba in ompi_allreduce_f > at > /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87 > #9 0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg > at linear_solver_deflation_m.f90:341 > > > *Calling MPI_WAITALL()* > > Program received signal SIGSEGV: Segmentation fault - invalid memory > reference. > > Backtrace for this error: > #0 0x7fda5a8d72ef in ??? > #1 0x7fda59fa5331 in mca_btl_vader_check_fboxes > at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208 > #2 0x7fda59fa5331 in mca_btl_vader_component_progress > at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689 > #3 0x7fda59f4092b in opal_progress > at ../../opal/runtime/opal_progress.c:226 > #4 0x7fda5bc5f9a4 in sync_wait_st > at ../../opal/threads/wait_sync.h:80 > #5 0x7fda5bc5f9a4 in ompi_request_default_wait_all > at ../../ompi/request/req_wait.c:221 > #6 0x7fda5bca329e in PMPI_Waitall > at > /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76 > #7 0x7fda5c10bc00 in ompi_waitall_f > at > /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104 > #8 0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1 > at data_comm_m.f90:5849 > > > The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at > 207 /* call the registered callback function */ > 208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, > reg->cbdata); > > > OpenMPI 2.1.5 is build with: > CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native > -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \ > ../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default > --disable-dlopen \ > --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx > --without-slurm --enable-mpi-thread-multiple --enable-debug > --enable-mem-debug > > Any help appreciated > > Patrick > > -- > =================================================================== > | Equipe M.O.S.T. | | > | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr > <patrick.be...@grenoble-inp.fr> | > | LEGI | | > | BP 53 X | Tel 04 76 82 51 35 | > | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | > =================================================================== > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users