Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption.
If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line. Barry > On Oct 29, 2020, at 1:04 AM, Marius Buerkle <[email protected]> wrote: > > Hi Sherry, > > I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with > OpenMP turned off. But did not help. > > Here is the output I can get from SuperLu during the PETSC run > Nonzeros in L 29519630 > Nonzeros in U 29519630 > nonzeros in L+U 58996711 > nonzeros in LSUB 4509612 > ** Memory Usage ********************************** > ** NUMfact space (MB): (sum-of-all-processes) > L\U : 952.18 | Total : 1980.60 > ** Total highmark (MB): > Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56 > ************************************************** > ************************************************** > **** Time (seconds) **** > EQUIL time 0.06 > ROWPERM time 1.03 > COLPERM time 1.01 > SYMBFACT time 0.45 > DISTRIBUTE time 0.33 > FACTOR time 0.90 > Factor flops 2.225916e+11 Mflops 247438.62 > SOLVE time 0.000 > ************************************************** > > I tried all available ordering options for Colperm > (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which > always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the > same seg. fault. > > > Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr > Von: "Xiaoye S. Li" <[email protected]> > An: "Marius Buerkle" <[email protected]> > Cc: "Zhang, Hong" <[email protected]>, "[email protected]" > <[email protected]>, "Sherry Li" <[email protected]> > Betreff: Re: Re: Re: [petsc-users] superlu_dist segfault > Hong: thanks for the diagnosis! > > Marius: how many OpenMP threads are you using per MPI task? > In an earlier email, you mentioned the allocation failure at the following > line: > if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * > sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[]."); > > this is in the solve phase. I think when we do some OpenMP optimization, we > allowed several data structures to grow with OpenMP threads. You can try to > use 1 thread. > > The RHS and X memories are easy to compute. However, in order to gauge how > much memory is used in the factorization, can you print out the number of > nonzeros in the L and U factors? What ordering option are you using? The > sparse matrix A looks pretty small. > > The code can also print out the working storage used during factorization. I > am not sure how this printing can be turned on through PETSc. > > Sherry > > On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <[email protected] > <mailto:[email protected]>> wrote: > Thanks for the swift reply. > > I also realized if I reduce the number of RHS then it works. But I am running > the code on a cluster with 256GB ram / node. One dense matrix would be > around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one > node and I also get the seg fault if I run it on several nodes. Moreover, it > works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when > using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it > crashed even before reaching the solver phase. Could there be such a large > difference in memory usage between SuperLu_dist and MUMPS ? > > > best, > > marius > > > Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr > Von: "Zhang, Hong" <[email protected] <mailto:[email protected]>> > An: "Marius Buerkle" <[email protected] <mailto:[email protected]>> > Cc: "[email protected] <mailto:[email protected]>" > <[email protected] <mailto:[email protected]>>, "Sherry Li" > <[email protected] <mailto:[email protected]>> > Betreff: Re: Re: [petsc-users] superlu_dist segfault > Marius, > I tested your code with petsc-release on my mac laptop using np=2 cores. I > first tested a small matrix data file successfully. Then I switch to your > data file and run out of memory, likely due to the dense matrices B and X. I > got an error "Your system has run out of application memory" from my laptop. > > The sparse matrix A has size 42549 by 42549. Your code creates dense matrices > B and X with the same size -- a huge memory requirement! > By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code > run well with np=2. Note the error message you got > [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > > The modified code I used is attached. > Hong > > From: Marius Buerkle <[email protected] <mailto:[email protected]>> > Sent: Tuesday, October 27, 2020 10:01 PM > To: Zhang, Hong <[email protected] <mailto:[email protected]>> > Cc: [email protected] <mailto:[email protected]> > <[email protected] <mailto:[email protected]>>; Sherry Li > <[email protected] <mailto:[email protected]>> > Subject: Aw: Re: [petsc-users] superlu_dist segfault > > Hi, > > I recompiled PETSC with debug option, now I get a seg fault at a different > position > > [23]PETSC ERROR: > ------------------------------------------------------------------------ > [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [23]PETSC ERROR: or see > https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> > [23]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on > GNU/linux and Apple Mac OS X to find memory corruption errors > [23]PETSC ERROR: likely location of problem given in stack below > [23]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [23]PETSC ERROR: INSTEAD the line number of the start of the function > [23]PETSC ERROR: is given. > [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 > /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c > [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 > /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c > [23]PETSC ERROR: [23] MatMatSolve line 3466 > /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c > [23]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [23]PETSC ERROR: Signal received > > I made a small reproducer. The matrix is a bit too big so I cannot attach it > directly to the email, but I put it in the cloud > https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw > <https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw> > > Best, > Marius > > > Gesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr > Von: "Zhang, Hong" <[email protected] <mailto:[email protected]>> > An: "Marius Buerkle" <[email protected] <mailto:[email protected]>>, > "[email protected] <mailto:[email protected]>" > <[email protected] <mailto:[email protected]>>, "Sherry Li" > <[email protected] <mailto:[email protected]>> > Betreff: Re: [petsc-users] superlu_dist segfault > Marius, > It fails at the line 1075 in file > /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c > if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * > sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[]."); > > We do not know what it means. You may use a debugger to check the values of > the variables involved. > I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone > short code that reproduce the error. We can help on its investigation. > Hong > > > From: petsc-users <[email protected] > <mailto:[email protected]>> on behalf of Marius Buerkle > <[email protected] <mailto:[email protected]>> > Sent: Tuesday, October 27, 2020 8:46 AM > To: [email protected] <mailto:[email protected]> > <[email protected] <mailto:[email protected]>> > Subject: [petsc-users] superlu_dist segfault > > Hi, > > When using MatMatSolve with superlu_dist I get a segmentation fault: > > Malloc fails for lsum[]. at line 1075 in file > /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c > > The matrix size is not particular big and I am using the petsc release branch > and superlu_dist is v6.3.0 I think. > > Best, > Marius
