Hi,
I cannot use on_error_attach_debugger as X forwarding does not work on the system. Is it possible to dump the gdb output to file instead?
I run it through valgrind. It seems there is some problem during calls in superlu_dist but I don't know if this eventually causes the seg fault. I think this is the relevant output:
==43569== Conditional jump or move depends on uninitialised value(s)
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Use of uninitialised value of size 8
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Use of uninitialised value of size 8
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Invalid write of size 8
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Invalid write of size 8
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Use of uninitialised value of size 8
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Use of uninitialised value of size 8
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Invalid write of size 8
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Invalid write of size 8
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
I also attached the whole log. Does this make any sense? The problem seems to be around where I get the original segfault.
best,
marius
Gesendet: Samstag, 31. Oktober 2020 um 04:07 Uhr
Von: "Barry Smith" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Xiaoye S. Li" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfault
Von: "Barry Smith" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Xiaoye S. Li" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfault
If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.
Barry
On Oct 29, 2020, at 1:04 AM, Marius Buerkle <[email protected]> wrote:Hi Sherry,I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help.Here is the output I can get from SuperLu during the PETSC runNonzeros in L 29519630
Nonzeros in U 29519630
nonzeros in L+U 58996711
nonzeros in LSUB 4509612** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 952.18 | Total : 1980.60
** Total highmark (MB):
Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56
**************************************************
**************************************************
**** Time (seconds) ****
EQUIL time 0.06
ROWPERM time 1.03
COLPERM time 1.01
SYMBFACT time 0.45
DISTRIBUTE time 0.33
FACTOR time 0.90
Factor flops 2.225916e+11 Mflops 247438.62
SOLVE time 0.000
**************************************************I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr
Von: "Xiaoye S. Li" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Zhang, Hong" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: Re: Re: [petsc-users] superlu_dist segfaultHong: thanks for the diagnosis!Marius: how many OpenMP threads are you using per MPI task?In an earlier email, you mentioned the allocation failure at the following line:if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.
The RHS and X memories are easy to compute. However, in order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.SherryOn Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <[email protected]> wrote:Thanks for the swift reply.
I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?
best,
marius
Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr
Von: "Zhang, Hong" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: Re: [petsc-users] superlu_dist segfaultMarius,I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of rangeThe modified code I used is attached.Hong
From: Marius Buerkle <[email protected]>
Sent: Tuesday, October 27, 2020 10:01 PM
To: Zhang, Hong <[email protected]>
Cc: [email protected] <[email protected]>; Sherry Li <[email protected]>
Subject: Aw: Re: [petsc-users] superlu_dist segfaultHi,I recompiled PETSC with debug option, now I get a seg fault at a different position[23]PETSC ERROR: ------------------------------------------------------------------------
[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[23]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[23]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[23]PETSC ERROR: likely location of problem given in stack below
[23]PETSC ERROR: --------------------- Stack Frames ------------------------------------
[23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[23]PETSC ERROR: INSTEAD the line number of the start of the function
[23]PETSC ERROR: is given.
[23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
[23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[23]PETSC ERROR: Signal receivedI made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloudBest,MariusGesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr
Von: "Zhang, Hong" <[email protected]>
An: "Marius Buerkle" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfaultMarius,It fails at the line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.cif ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");We do not know what it means. You may use a debugger to check the values of the variables involved.I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.Hong
From: petsc-users <[email protected]> on behalf of Marius Buerkle <[email protected]>
Sent: Tuesday, October 27, 2020 8:46 AM
To: [email protected] <[email protected]>
Subject: [petsc-users] superlu_dist segfaultHi,When using MatMatSolve with superlu_dist I get a segmentation fault:Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.cThe matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.Best,Marius
valgrind.tar.gz
Description: Binary data
