Nathan and Eric,

there is a know issue of libnbc not correctly retaining datatypes, for example, if you start a non blocking collective operation (MPI_Ibcast for example) and they MPI_Type_destroy() the datatype *before* the non blocking collective completes, then the datatype is free'd by MPI_Type_destroy, and is invalid when required to progress the collective operation.


a patch for v2.x is available at https://github.com/ggouaillardet/ompi-release/commit/cd30056efeff7d37257ab2bc0dbffb2e05a6170c.patch

/* note it only fixes MPI_Ibcast() for now, but if the root cause of the petsc crash is this, i can easily fix other non blocking primitives */


/* ideally, this should be done at the ompi level, and not in the libnbc module, that is why https://github.com/open-mpi/ompi/pull/1305 has not been merged yet */


Cheers,


Gilles


On 7/26/2016 11:46 AM, Nathan Hjelm wrote:
It looks to me like double free on both send and receive requests. The receive 
free is an extra OBJ_RELEASE of MPI_DOUBLE which was not malloced (invalid 
free). The send free is an assert failure in OBJ_RELEASE of an OBJ_NEW() object 
(invalid magic). I plan to look at in in the next couple of days. Let me know 
if you figure it out before I get to it.

-Nathan

On Jul 25, 2016, at 8:38 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:

Eric,

where can your test case be downloaded ? how many nodes and tasks do you need 
to reproduce the bug ?

fwiw, currently there are two Open MPI repositories
- https://github.com/open-mpi/ompi
  there is only one branch and is the 'master' branch, today, this can be seen 
as Open MPI 3.0 pre alpha
- https://github.com/open-mpi/ompi-release
  the default branch is 'v2.x', today, this can be seen as Open MPI 2.0.1 pre 
alpha

Cheers,

Gilles

On 7/26/2016 3:33 AM, Eric Chamberland wrote:
Hi,

has someone tried OpenMPI 2.0 with Petsc 3.7.2?

I am having some errors with petsc, maybe someone have them too?

Here are the configure logs for PETSc:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log

And for OpenMPI:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log

(in fact, I am testing the ompi-release branch, a sort of petsc-master branch, 
since I need the commit 9ba6678156).

For a set of parallel tests, I have 104 that works on 124 total tests.

And the typical error:
*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev':
 free(): invalid pointer:
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7f80eb11677f]
/lib64/libc.so.6(+0x78026)[0x7f80eb11c026]
/lib64/libc.so.6(+0x78d53)[0x7f80eb11cd53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f80ea8f9d60]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f80df0ea628]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f80df0eac50]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f80eb7029dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f80eb702ad6]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f80f2fa6c6d]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f80f2fa1c45]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0xa9d0f5)[0x7f80f35960f5]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f80f35c2588]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x10bf0f4)[0x7f80f3bb80f4]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f80f3a79fd9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f80f3d1a334]

a similar one:
*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProbFluideIncompressible.dev':
 free(): invalid pointer: 0x00007f382a7c5bc0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7f3829f1c77f]
/lib64/libc.so.6(+0x78026)[0x7f3829f22026]
/lib64/libc.so.6(+0x78d53)[0x7f3829f22d53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f38296ffd60]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f381deab628]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f381deabc50]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f382a5089dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f382a508ad6]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f3831dacc6d]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f3831da7c45]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x9f4755)[0x7f38322f3755]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f38323c8588]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x4e2)[0x7f383287f87a]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f383287ffd9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f3832b20334]

another one:

*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.MortierDiffusion.dev':
 free(): invalid pointer: 0x00007f67b6d37bc0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7f67b648e77f]
/lib64/libc.so.6(+0x78026)[0x7f67b6494026]
/lib64/libc.so.6(+0x78d53)[0x7f67b6494d53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f67b5c71d60]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x1adae)[0x7f67aa4cddae]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x1b4ca)[0x7f67aa4ce4ca]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f67b6a7a9dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f67b6a7aad6]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adb09)[0x7f67be31eb09]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f67be319c45]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574f7)[0x7f67be2c84f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f67be26e8da]

I feel like I should wait until someone else from Petsc have tested it too...

Thanks,

Eric
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/07/19276.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/07/19281.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/07/19282.php


Reply via email to