I can't speculate on why you did not notice the memory issue before, simply
because for months we (the developers) didn't noticed and our testing
infrastructure didn't catch this bug despite running millions of tests. The
root cause of the bug was a memory ordering issue, and these are really
tricky to identify.

According to https://github.com/open-mpi/ompi/issues/5638 the patch was
backported to all stable releases starting from 2.1. Until their official
release however you would either need to get a nightly snapshot or test
your luck with master.

  George.


On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi George
>
> thanks for your answer. I was previously using OpenMPI 3.1.2 and have also
> this problem. However, using --enable-debug --enable-mem-debug at
> configuration time, I was unable to reproduce the failure and it was quite
> difficult for me do trace the problem. May be I have not run enought tests
> to reach the failure point.
>
> I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x
> version. The problem was still there but with the debug config I was able
> to trace the call stack.
>
> Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning
> the git repo ?
>
> Thanks
>
> Patrick
>
> George Bosilca wrote:
>
> Few days ago we have pushed a fix in master for a strikingly similar
> issue. The patch will eventually make it in the 4.0 and 3.1 but not on the
> 2.x series. The best path forward will be to migrate to a more recent OMPI
> version.
>
> George.
>
>
> On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
> patrick.be...@legi.grenoble-inp.fr> wrote:
>
>> Hi
>>
>> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
>> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
>> Same binary, same server, same number of processes (16), same parameters
>> for the run. Sometimes it runs until the end, sometime I get  'invalid
>> memory reference'.
>>
>> Building the application and OpenMPI in debug mode I saw that this random
>> segfault always occur in collective communications inside OpenMPI. I've no
>> idea howto track this. These are 2 call stack traces (just the openmpi
>> part):
>>
>> *Calling  MPI_ALLREDUCE(...)*
>>
>> Program received signal SIGSEGV: Segmentation fault - invalid memory
>> reference.
>>
>> Backtrace for this error:
>> #0  0x7f01937022ef in ???
>> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
>>     at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
>>     at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7f0192d6b92b in opal_progress
>>     at ../../opal/runtime/opal_progress.c:226
>> #4  0x7f0194a8a9a4 in sync_wait_st
>>     at ../../opal/threads/wait_sync.h:80
>> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
>>     at ../../ompi/request/req_wait.c:221
>> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
>>     at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
>> #7  0x7f0194aa0a0a in PMPI_Allreduce
>>     at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
>> #8  0x7f0194f2e2ba in ompi_allreduce_f
>>     at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
>> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
>>     at linear_solver_deflation_m.f90:341
>>
>>
>> *Calling MPI_WAITALL()*
>>
>> Program received signal SIGSEGV: Segmentation fault - invalid memory
>> reference.
>>
>> Backtrace for this error:
>> #0  0x7fda5a8d72ef in ???
>> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
>>     at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
>>     at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7fda59f4092b in opal_progress
>>     at ../../opal/runtime/opal_progress.c:226
>> #4  0x7fda5bc5f9a4 in sync_wait_st
>>     at ../../opal/threads/wait_sync.h:80
>> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
>>     at ../../ompi/request/req_wait.c:221
>> #6  0x7fda5bca329e in PMPI_Waitall
>>     at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
>> #7  0x7fda5c10bc00 in ompi_waitall_f
>>     at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
>> #8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
>>     at data_comm_m.f90:5849
>>
>>
>> The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
>> 207                /* call the registered callback function */
>> 208               reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc,
>> reg->cbdata);
>>
>>
>> OpenMPI 2.1.5 is build with:
>> CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
>> -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
>> ../configure --prefix=$DESTMPI  --enable-mpirun-prefix-by-default
>> --disable-dlopen \
>> --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
>> --without-slurm --enable-mpi-thread-multiple  --enable-debug
>> --enable-mem-debug
>>
>> Any help appreciated
>>
>> Patrick
>>
>> --
>> ===================================================================
>> |  Equipe M.O.S.T.         |                                      |
>> |  Patrick BEGOU           | mailto:patrick.be...@grenoble-inp.fr 
>> <patrick.be...@grenoble-inp.fr> |
>> |  LEGI                    |                                      |
>> |  BP 53 X                 | Tel 04 76 82 51 35                   |
>> |  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71                   |
>> ===================================================================
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing 
> listus...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> --
> ===================================================================
> |  Equipe M.O.S.T.         |                                      |
> |  Patrick BEGOU           | mailto:patrick.be...@grenoble-inp.fr 
> <patrick.be...@grenoble-inp.fr> |
> |  LEGI                    |                                      |
> |  BP 53 X                 | Tel 04 76 82 51 35                   |
> |  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71                   |
> ===================================================================
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to