Some updates for this OpenMPI bug:
1) It appears to OpenMPI 2.1.x when configured with --enable-heterogeneous,
which is not a default option and is not commonly used. But Ubuntu somehow used
that.
2) OpenMPI fixed it in 3.x
3) It was reported to Ubuntu two years ago but is still unassigned.
Awesome, many thanks for your efforts!
On 7/31/19 9:17 PM, Zhang, Junchao wrote:
> Hi, Fabian,
> I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using
> MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in
> recv buffer.
> I have a workaround
>
Hi, Fabian,
I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using
MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in recv
buffer.
I have a workaround
Note in init.c that, by default, PETSc does not use PetscTrMallocDefault()
when valgrind is running; because it doesn't necessarily make sense to put one
memory checker on top of another memory checker. So, at a glance, I'm puzzled
how it can be in the routine PetscTrMallocDefault(). Do you
Fabian,
I happen have a Ubuntu virtual machine and I could reproduce the error with
your mini-test, even with two processes. It is horrible to see wrong results in
such a simple test.
We'd better figure out whether it is a PETSc bug or an OpenMPI bug. If it is
latter, which MPI call is at
Satish,
Can you please add to MPI.py a check for this and simply reject it telling
the user there are bugs in that version of OpenMP/ubuntu?
It is not debuggable, and hence not fixable and wastes everyones time and
could even lead to wrong results (which is worse than crashing). We've
We've seen such behavior with ubuntu default OpenMPI - but have no
idea why this happens or if we can work around it.
Last I checked - the same version of openmpi - when installed
separately did not exhibit such issues..
Satish
On Tue, 30 Jul 2019, Fabian.Jakub via petsc-dev wrote:
> Dear
Dear Petsc Team,
Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and
(Open MPI) 2.1.1 - with this I ended up with segfault and valgrind
errors in DMDAGlobalToNatural.
This is evident in a minimal fortran example such as the attached
example petsc_ex.F90
with the following error: