Hi Hong, According to this PR https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff
Should we set the scalable algorithm as default? Thanks, Fande Kong, On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users < petsc-users@mcs.anl.gov> wrote: > Add option '-mattransposematmult_via scalable' > Hong > > On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users < > petsc-users@mcs.anl.gov> wrote: > >> I saw the following error message in your first email. >> >> [0]PETSC ERROR: Out of memory. This could be due to allocating >> [0]PETSC ERROR: too large an object or bleeding by not properly >> [0]PETSC ERROR: destroying unneeded objects. >> >> Probably the matrix is too large. You can try with more compute nodes, >> for example, use 8 nodes instead of 2, and see what happens. >> >> --Junchao Zhang >> >> >> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users < >> petsc-users@mcs.anl.gov> wrote: >> >>> Using a larger problem set with 2B non-zero elements and a matrix of 25M >>> x 25M I get the following error: >>> [4]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>> probably memory access out of range >>> [4]PETSC ERROR: Try option -start_in_debugger or >>> -on_error_attach_debugger >>> [4]PETSC ERROR: or see >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac >>> OS X to find memory corruption errors >>> [4]PETSC ERROR: likely location of problem given in stack below >>> [4]PETSC ERROR: --------------------- Stack Frames >>> ------------------------------------ >>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not >>> available, >>> [4]PETSC ERROR: INSTEAD the line number of the start of the >>> function >>> [4]PETSC ERROR: is given. >>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422 >>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c >>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747 >>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c >>> [4]PETSC ERROR: [4] >>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256 >>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c >>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156 >>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c >>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950 >>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c >>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871 >>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c >>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428 >>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c >>> [4]PETSC ERROR: [4] PCSetUp line 894 >>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c >>> [4]PETSC ERROR: [4] KSPSetUp line 304 >>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c >>> [4]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [4]PETSC ERROR: Signal received >>> [4]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html >>> for trouble shooting. >>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown >>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by >>> vef002 Fri Jan 11 09:13:23 2019 >>> [4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug >>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc >>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort >>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx >>> --download-parmetis --download-metis --download-ptscotch >>> --download-superlu_dist --download-mumps --with-scalar-type=complex >>> --with-debugging=yes --download-scalapack --download-superlu >>> --download-fblaslapack=1 --download-cmake >>> [4]PETSC ERROR: #1 User provided function() line 0 in unknown file >>> >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD >>> with errorcode 59. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> >>> -------------------------------------------------------------------------- >>> [0]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the >>> batch system) has told this process to end >>> [0]PETSC ERROR: Try option -start_in_debugger or >>> -on_error_attach_debugger >>> [0]PETSC ERROR: or see >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>> >>> Using Valgrind on only one of the valgrind files the following error was >>> written: >>> >>> ==9053== Invalid read of size 4 >>> ==9053== at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445) >>> ==9053== by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ >>> (matmatmult.c:790) >>> ==9053== by 0x5D106F8: >>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337) >>> ==9053== by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ >>> (mpimatmatmult.c:1186) >>> ==9053== by 0x5457C57: MatTransposeMatMult (matrix.c:9984) >>> ==9053== by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882) >>> ==9053== by 0x64C7527: PCSetUp_GAMG (gamg.c:522) >>> ==9053== by 0x6592AA0: PCSetUp (precon.c:932) >>> ==9053== by 0x66B1267: KSPSetUp (itfunc.c:391) >>> ==9053== by 0x4019A2: main (solveCmplxLinearSys.cpp:68) >>> ==9053== Address 0x8386997f4 is not stack'd, malloc'd or (recently) >>> free'd >>> ==9053== >>> >>> >>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempoho...@gmail.com> wrote: >>> >>>> Thank you Dave, >>>> >>>> I reconfigured PETSc with valgrind and debugging mode, I ran the code >>>> again with the following options: >>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20 >>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type >>>> gamg -log_view >>>> (as on the petsc website you linked) >>>> >>>> It finished solving using the iterative solver, but the resulting >>>> valgrind.log.%p files (all 8 corresponding to each processor) are all >>>> empty. And it took a whooping ~15hours, for what used to take ~10-20min. >>>> Maybe this is because of valgrind? I am not sure. Attached is the log_view. >>>> >>>> >>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhe...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users < >>>>> petsc-users@mcs.anl.gov> wrote: >>>>> >>>>>> I am not sure what is exactly is wrong as the error changes slightly >>>>>> every time I run it (without changing the parameters). >>>>>> >>>>> >>>>> This likely implies that you have a memory error in your code (a >>>>> memory leak would not cause this behaviour). >>>>> I strongly suggest you make sure your code is free of memory errors. >>>>> You can do this using valgrind. See here >>>>> >>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>>> >>>>> for an explanation of how to use valgrind. >>>>> >>>>> >>>>>> I have attached the first two run's errors and my code. >>>>>> >>>>>> Is there a memory leak somewhere? I have tried running it with >>>>>> -malloc_dump, but not getting anything printed out, however, when run >>>>>> with >>>>>> -log_view I see that Viewer is created 4 times, but destroyed 3 times. >>>>>> The >>>>>> way I see it, I have destroyed it where I see I no longer have use for it >>>>>> so not sure if I am wrong. Could this be the reason why it keeps >>>>>> crashing? >>>>>> It crashes as soon as it reads the matrix, before entering the solving >>>>>> mode >>>>>> (I have a print statement before solving starts that never prints). >>>>>> >>>>>> how I run it in the job script on 2 node with 32 processors using the >>>>>> clusters OpenMPI. >>>>>> >>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg >>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view >>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view >>>>>> >>>>>> the matrix: >>>>>> 2 122 821 366 (non-zero elements) >>>>>> 25 947 279 x 25 947 279 >>>>>> >>>>>> Thanks and all the best >>>>>> >>>>>