Re: [petsc-users] MPI Iterative solver crash on HPC

Zhang, Junchao via petsc-users Thu, 17 Jan 2019 19:12:52 -0800

Usually when I meet a SEGV error, I will run it again with a parallel debugger 
like DDT and wait for it to segfault, and then examine the stack trace to see 
what is wrong.
Can you compress your matrix and upload it to google drive, so we can try to 
reproduce the error.
--Junchao Zhang



On Thu, Jan 17, 2019 at 10:44 AM Sal Am via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
I did two runs, one with the SuperLU_dist and one with bcgs using jacobi, 
attached are the results of one of the reports from valgrind on one random 
processor (out of the 128 files).

DS = direct solver
IS = iterative  solver

There is an awful lot of errors.

how I initiated the two runs:
mpiexec valgrind --tool=memcheck -q --num-callers=20 
--log-file=valgrind.log-IS.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type 
jacobi -mattransposematmult_via scalable -build_twosided allreduce -ksp_monitor 
-log_view

mpiexec valgrind --tool=memcheck -q --num-callers=20 
--log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres -pc_type 
lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1 
-ksp_monitor_true_residual -log_view -ksp_error_if_not_converged

Thank you

On Thu, Jan 17, 2019 at 4:24 PM Matthew Knepley 
<knep...@gmail.com<mailto:knep...@gmail.com>> wrote:
On Thu, Jan 17, 2019 at 9:18 AM Sal Am 
<tempoho...@gmail.com<mailto:tempoho...@gmail.com>> wrote:
1) Running out of memory

2) You passed an invalid array
I have select=4:ncpus=32:mpiprocs=32:mem=300GB in the job script, i.e. using 
300GB/node, a total of 1200GB memory, using 4 nodes and 32 processors per node 
(128 processors in total).
I am not sure what would constitute an invalid array or how I can check that. I 
am using the same procedure as when dealing with the smaller matrix. i.e. 
Generate matrix A and vector b using FEM software then convert the matrix and 
vector using a python script ready for petsc. read in petsc and calculate.

Are you running with 64-bit ints here?
Yes I have it configured petsc with  --with-64-bit-indices and debugging mode, 
which this was run on.

It sounds like you have enough memory, but the fact that is runs for smaller 
problems makes me suspicious. It
could still be a memory overwrite. Can you either

a) Run under valgrind

or

b) Run under the debugger and get a stack trace

 ?

   Thanks,

    Matt

On Thu, Jan 17, 2019 at 1:59 PM Matthew Knepley 
<knep...@gmail.com<mailto:knep...@gmail.com>> wrote:
On Thu, Jan 17, 2019 at 8:16 AM Sal Am 
<tempoho...@gmail.com<mailto:tempoho...@gmail.com>> wrote:
SuperLU_dist supports 64-bit ints. Are you not running in parallel?
I will try that, although I think solving the real problem (later on if I can 
get this to work) with 30 million finite elements might be a problem for 
SuperLU_dist. so it is better to get an iterative solver to work with first.

1) Try using    -build_twosided allreduce    on this run
How I ran it: mpiexec valgrind --tool=memcheck -q --num-callers=20 
--log-file=valgrind.log-osaii.%p ./solveCSys -malloc off -ksp_type bcgs 
-pc_type gamg -mattransposematmult_via scalable -build_twosided allreduce 
-ksp_monitor -log_view
I have attached the full error output.

You are getting an SEGV on MatSetValues(), so its either

1) Running out of memory

2) You passed an invalid array

2) Is it possible to get something that fails here but we can run. None of our 
tests show this problem.
I am not how I can do that, but i have added my code which is quite short and 
should only read and solve the system, the problem arises at larger matrices 
for example current test case has 6 million finite elements (~2B non-zero 
numbers and 25M x 25M matrix).

Are you running with 64-bit ints here?

   Matt

On Wed, Jan 16, 2019 at 1:12 PM Matthew Knepley 
<knep...@gmail.com<mailto:knep...@gmail.com>> wrote:
On Wed, Jan 16, 2019 at 3:52 AM Sal Am via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
The memory requested is an insane number. You may need to use 64 bit integers.
Thanks Mark, I reconfigured it to use 64bit, however in the process it says I 
can no longer use MUMPS and SuperLU as they are not supported (I see on MUMPS 
webpage it supports 64int). However it does not exactly solve the problem.

SuperLU_dist supports 64-bit ints. Are you not running in parallel?

This time, it crashes at
[6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989 in 
/lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
ierr      = PetscMalloc1(bi[pn]+1,&bj);
which allocates local portion of B^T*A.
You may also try to increase number of cores to reduce local matrix size.

So I increased the number of cores to 16 on one node and ran it by :
mpiexec valgrind --tool=memcheck -q --num-callers=20 
--log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type 
gamg -mattransposematmult_via scalable -ksp_monitor -log_view
It crashed after reading in the matrix and before starting to solve. The error:

[15]PETSC ERROR: [0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch 
system) has told this process to end
[0]PETSC ERROR: [1]PETSC ERROR: 
------------------------------------------------------------------------
[2]PETSC ERROR: 
------------------------------------------------------------------------
[2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch 
system) has told this process to end
[3]PETSC ERROR: 
------------------------------------------------------------------------
[4]PETSC ERROR: 
------------------------------------------------------------------------
[4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR: 
------------------------------------------------------------------------
[8]PETSC ERROR: 
------------------------------------------------------------------------
[12]PETSC ERROR: 
------------------------------------------------------------------------
[12]PETSC ERROR: [14]PETSC ERROR: 
------------------------------------------------------------------------
[14]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch 
system) has told this process to end
--------------------------------------------------------------------------
mpiexec noticed that process rank 10 with PID 0 on node r03n01 exited on signal 
9 (Killed).

Now I was running this with valgrind as someone had previously suggested and 
the 16 files created all contain the same type of error:

Okay, its possible that there are bugs in the MPI implementation. So

1) Try using    -build_twosided allreduce    on this run

2) Is it possible to get something that fails here but we can run. None of our 
tests show this problem.

  Thanks,

     Matt

==25940== Invalid read of size 8
==25940==    at 0x5103326: PetscCheckPointer (checkptr.c:81)
==25940==    by 0x4F42058: PetscCommGetNewTag (tagm.c:77)
==25940==    by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:373)
==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
==25940==  Address 0x19f807fc is 12 bytes inside a block of size 16 alloc'd
==25940==    at 0x4C2A603: memalign (vg_replace_malloc.c:899)
==25940==    by 0x4FD0B0E: PetscMallocAlign (mal.c:41)
==25940==    by 0x4FD23E7: PetscMallocA (mal.c:397)
==25940==    by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:371)
==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
==25940==


On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong 
<hzh...@mcs.anl.gov<mailto:hzh...@mcs.anl.gov>> wrote:
Fande:
According to this PR 
https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff

Should we set the scalable algorithm as default?
Sure, we can. But I feel we need do more tests to compare scalable and 
non-scalable algorithms.
On theory, for small to medium matrices, non-scalable matmatmult() algorithm 
enables more efficient
data accessing. Andreas optimized scalable implementation. Our non-scalable 
implementation might have room to be further optimized.
Hong

On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
Add option '-mattransposematmult_via scalable'
Hong

On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
I saw the following error message in your first email.
[0]PETSC ERROR: Out of memory. This could be due to allocating
[0]PETSC ERROR: too large an object or bleeding by not properly
[0]PETSC ERROR: destroying unneeded objects.
Probably the matrix is too large. You can try with more compute nodes, for 
example, use 8 nodes instead of 2, and see what happens.

--Junchao Zhang


On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
Using a larger problem set with 2B non-zero elements and a matrix of 25M x 25M 
I get the following error:
[4]PETSC ERROR: 
------------------------------------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[4]PETSC ERROR: or see 
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
find memory corruption errors
[4]PETSC ERROR: likely location of problem given in stack below
[4]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[4]PETSC ERROR:       INSTEAD the line number of the start of the function
[4]PETSC ERROR:       is given.
[4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422 
/lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
[4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747 
/lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 
1256 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156 
/lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMult line 9950 
/lustre/home/vef002/petsc/src/mat/interface/matrix.c
[4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871 
/lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
[4]PETSC ERROR: [4] PCSetUp_GAMG line 428 
/lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
[4]PETSC ERROR: [4] PCSetUp line 894 
/lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
[4]PETSC ERROR: [4] KSPSetUp line 304 
/lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
[4]PETSC ERROR: --------------------- Error Message 
--------------------------------------------------------------
[4]PETSC ERROR: Signal received
[4]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for 
trouble shooting.
[4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
[4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by vef002 Fri 
Jan 11 09:13:23 2019
[4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug 
--with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc 
--with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort 
--with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx 
--download-parmetis --download-metis --download-ptscotch 
--download-superlu_dist --download-mumps --with-scalar-type=complex 
--with-debugging=yes --download-scalapack --download-superlu 
--download-fblaslapack=1 --download-cmake
[4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch 
system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see 
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind

Using Valgrind on only one of the valgrind files the following error was 
written:

==9053== Invalid read of size 4
==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)
==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ (matmatmult.c:790)
==9053==    by 0x5D106F8: MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable 
(mpimatmatmult.c:1337)
==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ 
(mpimatmatmult.c:1186)
==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
==9053==  Address 0x8386997f4 is not stack'd, malloc'd or (recently) free'd
==9053==


On Fri, Jan 11, 2019 at 8:41 AM Sal Am 
<tempoho...@gmail.com<mailto:tempoho...@gmail.com>> wrote:
Thank you Dave,

I reconfigured PETSc with valgrind and debugging mode, I ran the code again 
with the following options:
mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20 
--log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg 
-log_view
(as on the petsc website you linked)

It finished solving using the iterative solver, but the resulting 
valgrind.log.%p files (all 8 corresponding to each processor) are all empty. 
And it took a whooping ~15hours, for what used to take ~10-20min. Maybe this is 
because of valgrind? I am not sure. Attached is the log_view.


On Thu, Jan 10, 2019 at 8:59 AM Dave May 
<dave.mayhe...@gmail.com<mailto:dave.mayhe...@gmail.com>> wrote:


On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
I am not sure what is exactly is wrong as the error changes slightly every time 
I run it (without changing the parameters).

This likely implies that you have a memory error in your code (a memory leak 
would not cause this behaviour).
I strongly suggest you make sure your code is free of memory errors.
You can do this using valgrind. See here

https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind

for an explanation of how to use valgrind.

I have attached the first two run's errors and my code.

Is there a memory leak somewhere? I have tried running it with -malloc_dump, 
but not getting anything printed out, however, when run with -log_view I see 
that Viewer is created 4 times, but destroyed 3 times. The way I see it, I have 
destroyed it where I see I no longer have use for it so not sure if I am wrong. 
Could this be the reason why it keeps crashing? It crashes as soon as it reads 
the matrix, before entering the solving mode (I have a print statement before 
solving starts that never prints).

how I run it in the job script on 2 node with 32 processors using the clusters 
OpenMPI.

mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg -ksp_converged_reason 
-ksp_monitor_true_residual -log_view -ksp_error_if_not_converged -ksp_monitor 
-malloc_log -ksp_view

the matrix:
2 122 821 366 (non-zero elements)
25 947 279 x 25 947 279

Thanks and all the best


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-users] MPI Iterative solver crash on HPC

Reply via email to