On Thu, 30 May 2013 17:19:04 +0200 Jan Blechta <[email protected]> wrote: > Johannes, now I see that only buildbots having old OpenMPI 1.4.3, as I
I forgot to mention that these are precise-amd64 and precise-i386. > have, does not run regression test in parallel. Is it the reason? Was > mpirun -n 3 demo_navier-stokes > failing with PETSc / Hypre Boomeramg error and deadlocking? > > Jan > > > On Thu, 30 May 2013 15:47:12 +0200 > Jan Blechta <[email protected]> wrote: > > I observed that Boomeramg eventually fails when running on 3,5,6 or > > 7 processes. When using 1,2,4,8 processes, it is ok. Strange enough > > is that nobody saw it but me as I can reproduce it very easily > > $np=3 # or 5,6,7 > > $export DOLFIN_NOPLOT=1 > > $mpirun -n $np demo_navier-stokes > > with FEniCS 1.0.0, PETSc 3.2 and with FEniCS dev, PETSc 3.4. After > > few timesteps PETSc fails and DOLFIN deadlocks. > > > > PETSc throws in this demo when solving projection step, i.e. Poisson > > problem, with both Dirichlet and zero Neumann condition, discretized > > by piecewise-linears on triangles. > > > > Regarding effort to reproduce it with PETSc directly, Jed, I was > > able to dump this specific matrix to binary format but not vector, > > so I need to obtain somehow binary vector - is somewhere > > documentation of that binary format? > > > > I guess I would need to recompile PETSc in some debug mode to break > > into Hypre, is it so? This is backtrace from process printing PETSc > > ERROR: > > __________________________________________________________________________ > > #0 0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6, > > timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83 #1 > > 0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0 #2 > > 0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0 #3 > > 0x00007fffed0b9221 in opal_progress () > > from /usr/lib/libopen-pal.so.0 #4 0x00007ffff1b593d5 in ?? () > > from /usr/lib/libmpi.so.0 #5 0x00007ffff1b8a1c5 in PMPI_Waitany () > > from /usr/lib/libmpi.so.0 #6 0x00007ffff2f5c43e in VecScatterEnd_1 > > () from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > > #7 0x00007ffff2f57811 in VecScatterEnd () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > > #8 0x00007ffff2f3cb9a in VecGhostUpdateEnd () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > > #9 0x00007ffff74ecdea in dolfin::Assembler::assemble > > (this=0x7fffffff9da0, > > A=..., a=...) > > at > > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/Assembler.cpp:96 > > #10 0x00007ffff74e8095 in dolfin::assemble (A=..., a=...) > > at > > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/assemble.cpp:38 > > #11 0x0000000000425d41 in main () > > at > > /usr/users/blechta/fenics/fenics/src/dolfin/demo/pde/navier-stokes/cpp/main.cpp:180 > > _________________________________________________________________________________________ > > > > > > This is backtrace from one deadlocked process: > > ______________________________________________________________________ > > #0 0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6, > > timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83 > > #1 0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0 > > #2 0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0 > > #3 0x00007fffed0b9221 in opal_progress () > > from /usr/lib/libopen-pal.so.0 #4 0x00007fffdb131a1d in ?? () > > from /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so > > #5 0x00007fffd9220db9 in ?? () > > from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so > > #6 0x00007ffff1b6dee9 in PMPI_Allreduce () > > from /usr/lib/libmpi.so.0 #7 0x00007ffff2e7aa74 in > > PetscSplitOwnership () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #8 > > 0x00007ffff2eee129 in PetscLayoutSetUp () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #9 > > 0x00007ffff2f31cf7 in VecCreate_MPI_Private () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #10 > > 0x00007ffff2f32092 in VecCreate_MPI () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #11 > > 0x00007ffff2f234f7 in VecSetType () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #12 > > 0x00007ffff2f32708 in VecCreate_Standard () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #13 > > 0x00007ffff2f234f7 in VecSetType () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so ---Type > > <return> to continue, or q <return> to quit--- #14 > > 0x00007ffff2fb75a1 in MatGetVecs () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #15 > > 0x00007ffff335fdc6 in PCSetUp_HYPRE () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #16 > > 0x00007ffff3362cd6 in PCSetUp () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #17 > > 0x00007ffff33f676e in KSPSetUp () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #18 > > 0x00007ffff33f7bfe in KSPSolve () > > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so #19 > > 0x00007ffff77082f4 in dolfin::PETScKrylovSolver::solve > > (this=0x9700f0, x= ..., b=...) > > at > > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:445 > > #20 0x00007ffff7709228 in dolfin::PETScKrylovSolver::solve > > (this=0x9700f0, A=..., x=..., b=...) > > at > > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:491 > > #21 0x00007ffff76d9303 in dolfin::KrylovSolver::solve > > (this=0x94a8e0, A=..., x=..., b=...) > > at > > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/KrylovSolver.cpp:147 > > #22 0x00007ffff76f4b91 in dolfin::LinearSolver::solve > > (this=0x7fffffff9c40, A=..., x=..., b=...) > > _____________________________________________________________________________________ > > > > > > On Wed, 29 May 2013 11:19:53 -0500 > > Jed Brown <[email protected]> wrote: > > > Jan Blechta <[email protected]> writes: > > > > > > > Maybe this is PETSc stack from previous time step - this is > > > > provided by DOLFIN. > > > > > > > >> Maybe you aren't checking error codes and try to do something > > > >> else collective? > > > > > > > > I don't know, I'm just using FEniCS. > > > > > > When I said "you", I was addressing the list in general, which > > > includes FEniCS developers. > > > > > > >> > [2]PETSC ERROR: PCDestroy() line 121 > > > >> > in /petsc-3.4.0/src/ksp/pc/interface/precon.c [2]PETSC ERROR: > > > >> > KSPDestroy() line 788 > > > >> > in /petsc-3.4.0/src/ksp/ksp/interface/itfunc.c > > > >> > > > > >> > and deadlocks. Did you seen it before? Where can be the > > > >> > problem? > > > >> > > > >> Deadlock must be back in your code. This error occurs on > > > >> PETSC_COMM_SELF, which means we have no way to ensure that the > > > >> error condition is collective. You can't just go calling other > > > >> collective functions after such an error. > > > > > > > > This means that DOLFIN handles poorly some error condition. > > > > > > It appears that way, but that appears to be independent of > > > whatever causes Hypre to return an error. > > > > > > >> Anyway, please set up a reproducible test case and/or get a > > > >> trace from inside Hypre. It will be useful for them to debug > > > >> the problem. > > > > > > > > I'm not PETSc user so it would be quite time-consuming for me to > > > > try to reproduce it without FEniCS. I will try at least get a > > > > trace. > > > > > > You can try dumping the matrix using '-ksp_view_mat > > > binary' (writes 'binaryoutput'), for example, then try solving it > > > using a PETSc example, e.g. src/ksp/ksp/examples/tutorials/ex10.c > > > with the same configuration via run-time options. > > > _______________________________________________ > > > fenics mailing list > > > [email protected] > > > http://fenicsproject.org/mailman/listinfo/fenics > > > > _______________________________________________ > > fenics mailing list > > [email protected] > > http://fenicsproject.org/mailman/listinfo/fenics > > _______________________________________________ > fenics mailing list > [email protected] > http://fenicsproject.org/mailman/listinfo/fenics _______________________________________________ fenics mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics
