The memory requested is an insane number. You may need to use 64 bit integers.
On Mon, Jan 14, 2019 at 8:06 AM Sal Am via petsc-users < petsc-users@mcs.anl.gov> wrote: > I ran it by: mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20 > --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs > -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view > The error: > > [6]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [6]PETSC ERROR: Out of memory. This could be due to allocating > [6]PETSC ERROR: too large an object or bleeding by not properly > [6]PETSC ERROR: destroying unneeded objects. > [6]PETSC ERROR: Memory allocated 0 Memory used by process 39398023168 > [6]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. > [6]PETSC ERROR: Memory requested 18446744066024411136 > [6]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html > for trouble shooting. > [6]PETSC ERROR: Petsc Release Version 3.10.2, unknown > [6]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by > vef002 Mon Jan 14 08:54:45 2019 > [6]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug > --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc > --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort > --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx > --download-parmetis --download-metis --download-ptscotch > --download-superlu_dist --download-mumps --with-scalar-type=complex > --with-debugging=yes --download-scalapack --download-superlu > --download-fblaslapack=1 --download-cmake > [6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989 > in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c > [6]PETSC ERROR: #2 PetscMallocA() line 397 in > /lustre/home/vef002/petsc/src/sys/memory/mal.c > [6]PETSC ERROR: #3 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989 > in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c > [6]PETSC ERROR: #4 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 1203 in > /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c > [6]PETSC ERROR: #5 MatTransposeMatMult() line 9984 in > /lustre/home/vef002/petsc/src/mat/interface/matrix.c > [6]PETSC ERROR: #6 PCGAMGCoarsen_AGG() line 882 in > /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c > [6]PETSC ERROR: #7 PCSetUp_GAMG() line 522 in > /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c > [6]PETSC ERROR: #8 PCSetUp() line 932 in > /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c > [6]PETSC ERROR: #9 KSPSetUp() line 391 in > /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c > [6]PETSC ERROR: #10 main() line 68 in > /home/vef002/debugenv/tests/solveCmplxLinearSys.cpp > [6]PETSC ERROR: PETSc Option Table entries: > [6]PETSC ERROR: -ksp_monitor > [6]PETSC ERROR: -ksp_type bcgs > [6]PETSC ERROR: -log_view > [6]PETSC ERROR: -malloc off > [6]PETSC ERROR: -mattransposematmult_via scalable > [6]PETSC ERROR: -pc_type gamg > [6]PETSC ERROR: ----------------End of Error Message -------send entire > error message to petsc-ma...@mcs.anl.gov---------- > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 6 in communicator MPI_COMM_WORLD > with errorcode 55. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > > Memory requested error seems astronomical though.... This was done on a > machine with 500GB of memory, during my last check it was using 30GB > mem/processor not sure if it increased suddenly. The file size of the > matrix is 40GB still same matrix > 2 122 821 366 (non-zero elements) > 25 947 279 x 25 947 279 > > > > On Fri, Jan 11, 2019 at 5:34 PM Zhang, Hong <hzh...@mcs.anl.gov> wrote: > >> Add option '-mattransposematmult_via scalable' >> Hong >> >> On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users < >> petsc-users@mcs.anl.gov> wrote: >> >>> I saw the following error message in your first email. >>> >>> [0]PETSC ERROR: Out of memory. This could be due to allocating >>> [0]PETSC ERROR: too large an object or bleeding by not properly >>> [0]PETSC ERROR: destroying unneeded objects. >>> >>> Probably the matrix is too large. You can try with more compute nodes, >>> for example, use 8 nodes instead of 2, and see what happens. >>> >>> --Junchao Zhang >>> >>> >>> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users < >>> petsc-users@mcs.anl.gov> wrote: >>> >>>> Using a larger problem set with 2B non-zero elements and a matrix of >>>> 25M x 25M I get the following error: >>>> [4]PETSC ERROR: >>>> ------------------------------------------------------------------------ >>>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>>> probably memory access out of range >>>> [4]PETSC ERROR: Try option -start_in_debugger or >>>> -on_error_attach_debugger >>>> [4]PETSC ERROR: or see >>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac >>>> OS X to find memory corruption errors >>>> [4]PETSC ERROR: likely location of problem given in stack below >>>> [4]PETSC ERROR: --------------------- Stack Frames >>>> ------------------------------------ >>>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not >>>> available, >>>> [4]PETSC ERROR: INSTEAD the line number of the start of the >>>> function >>>> [4]PETSC ERROR: is given. >>>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422 >>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c >>>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747 >>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c >>>> [4]PETSC ERROR: [4] >>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256 >>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c >>>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156 >>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c >>>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950 >>>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c >>>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871 >>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c >>>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428 >>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c >>>> [4]PETSC ERROR: [4] PCSetUp line 894 >>>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c >>>> [4]PETSC ERROR: [4] KSPSetUp line 304 >>>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c >>>> [4]PETSC ERROR: --------------------- Error Message >>>> -------------------------------------------------------------- >>>> [4]PETSC ERROR: Signal received >>>> [4]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html >>>> for trouble shooting. >>>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown >>>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by >>>> vef002 Fri Jan 11 09:13:23 2019 >>>> [4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug >>>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc >>>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort >>>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx >>>> --download-parmetis --download-metis --download-ptscotch >>>> --download-superlu_dist --download-mumps --with-scalar-type=complex >>>> --with-debugging=yes --download-scalapack --download-superlu >>>> --download-fblaslapack=1 --download-cmake >>>> [4]PETSC ERROR: #1 User provided function() line 0 in unknown file >>>> >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD >>>> with errorcode 59. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> >>>> -------------------------------------------------------------------------- >>>> [0]PETSC ERROR: >>>> ------------------------------------------------------------------------ >>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the >>>> batch system) has told this process to end >>>> [0]PETSC ERROR: Try option -start_in_debugger or >>>> -on_error_attach_debugger >>>> [0]PETSC ERROR: or see >>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>> >>>> Using Valgrind on only one of the valgrind files the following error >>>> was written: >>>> >>>> ==9053== Invalid read of size 4 >>>> ==9053== at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445) >>>> ==9053== by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ >>>> (matmatmult.c:790) >>>> ==9053== by 0x5D106F8: >>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable >>>> (mpimatmatmult.c:1337) >>>> ==9053== by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ >>>> (mpimatmatmult.c:1186) >>>> ==9053== by 0x5457C57: MatTransposeMatMult (matrix.c:9984) >>>> ==9053== by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882) >>>> ==9053== by 0x64C7527: PCSetUp_GAMG (gamg.c:522) >>>> ==9053== by 0x6592AA0: PCSetUp (precon.c:932) >>>> ==9053== by 0x66B1267: KSPSetUp (itfunc.c:391) >>>> ==9053== by 0x4019A2: main (solveCmplxLinearSys.cpp:68) >>>> ==9053== Address 0x8386997f4 is not stack'd, malloc'd or (recently) >>>> free'd >>>> ==9053== >>>> >>>> >>>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempoho...@gmail.com> wrote: >>>> >>>>> Thank you Dave, >>>>> >>>>> I reconfigured PETSc with valgrind and debugging mode, I ran the code >>>>> again with the following options: >>>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20 >>>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type >>>>> gamg -log_view >>>>> (as on the petsc website you linked) >>>>> >>>>> It finished solving using the iterative solver, but the resulting >>>>> valgrind.log.%p files (all 8 corresponding to each processor) are all >>>>> empty. And it took a whooping ~15hours, for what used to take ~10-20min. >>>>> Maybe this is because of valgrind? I am not sure. Attached is the >>>>> log_view. >>>>> >>>>> >>>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhe...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users < >>>>>> petsc-users@mcs.anl.gov> wrote: >>>>>> >>>>>>> I am not sure what is exactly is wrong as the error changes slightly >>>>>>> every time I run it (without changing the parameters). >>>>>>> >>>>>> >>>>>> This likely implies that you have a memory error in your code (a >>>>>> memory leak would not cause this behaviour). >>>>>> I strongly suggest you make sure your code is free of memory errors. >>>>>> You can do this using valgrind. See here >>>>>> >>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>>>> >>>>>> for an explanation of how to use valgrind. >>>>>> >>>>>> >>>>>>> I have attached the first two run's errors and my code. >>>>>>> >>>>>>> Is there a memory leak somewhere? I have tried running it with >>>>>>> -malloc_dump, but not getting anything printed out, however, when run >>>>>>> with >>>>>>> -log_view I see that Viewer is created 4 times, but destroyed 3 times. >>>>>>> The >>>>>>> way I see it, I have destroyed it where I see I no longer have use for >>>>>>> it >>>>>>> so not sure if I am wrong. Could this be the reason why it keeps >>>>>>> crashing? >>>>>>> It crashes as soon as it reads the matrix, before entering the solving >>>>>>> mode >>>>>>> (I have a print statement before solving starts that never prints). >>>>>>> >>>>>>> how I run it in the job script on 2 node with 32 processors using >>>>>>> the clusters OpenMPI. >>>>>>> >>>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg >>>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view >>>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view >>>>>>> >>>>>>> the matrix: >>>>>>> 2 122 821 366 (non-zero elements) >>>>>>> 25 947 279 x 25 947 279 >>>>>>> >>>>>>> Thanks and all the best >>>>>>> >>>>>>