Sorry it is also include in the archive on onedrive, should have mentioned it. It is the same code and data as I send in the beginning, I didn't change anything I think.
Gesendet: Dienstag, 03. November 2020 um 00:45 Uhr
Von: "Barry Smith" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Stefano Zampini" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfault
Von: "Barry Smith" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Stefano Zampini" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfault
On Nov 2, 2020, at 9:27 AM, Marius Buerkle <[email protected]> wrote:The matrix is a bit too big for email attachment, I put it on onedriveGesendet: Montag, 02. November 2020 um 23:58 Uhr
Von: "Barry Smith" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Stefano Zampini" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfaultPlease send this program and your data file. This should definitely not be happening.BarryValgrind is generally trustworthy.On Nov 2, 2020, at 12:21 AM, Marius Buerkle <[email protected]> wrote:<valgrind_track-origins.tar.gz>Hi,I tried valgrind with track-origins, valgrind crashes at somepoint due to running out of energy though. But before I get a lot of"Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 8" not all of them related to Petsc but some of them are during MatLoad, PCSetup_LU, and also in Superlu. For example==41867== Conditional jump or move depends on uninitialised value(s)
==41867== at 0x5DEA7C4: MatSetValues_MPIAIJ (mpiaij.c:601)
==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)
==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)
==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
==41867== by 0x4063ED: main (superlu_test.c:28)
==41867== Uninitialised value was created by a heap allocation
==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)
==41867== by 0x50242D4: PetscMallocA (mal.c:425)
==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
==41867== by 0x4063ED: main (superlu_test.c:28)
==41867==
==41867== Use of uninitialised value of size 8
==41867== at 0x5DEA8AE: MatSetValues_MPIAIJ (mpiaij.c:603)
==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)
==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)
==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
==41867== by 0x4063ED: main (superlu_test.c:28)
==41867== Uninitialised value was created by a heap allocation
==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)
==41867== by 0x50242D4: PetscMallocA (mal.c:425)
==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
==41867== by 0x4063ED: main (superlu_test.c:28)I don't know if this are real errors or only some problem of valgrind. I attached th whole valgrind logs, they are rather noisy though.Best,MariusGesendet: Sonntag, 01. November 2020 um 19:09 Uhr
Von: "Stefano Zampini" <[email protected]>
An: "Barry Smith" <[email protected]>
Cc: "Marius Buerkle" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfaultMore importantly,==43569== Conditional jump or move depends on uninitialised value(s)
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)You should run using valgrind's option --track-origins=yes to understand the reason for this.Il giorno dom 1 nov 2020 alle ore 11:53 Barry Smith <[email protected]> ha scritto:You can sometimes use -on_error_attach_debugger noxterm and it will try to attach just in the console you started the job. If you are lucky this works and you use bt and see the stack and look at variables. But if multiple ranks crash the debugger will get confused and even if only one crashes if it is not rank zero the stty can get messed up so you cannot type to control the debugger.The valgrind information is very valuable, likely Sherry can look at those lines and have a really good idea what the problem is, for example,Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'dmeans that for some reason the code is writing past the end of an allocated array, either because the array allocated was not long enough or the code has some issue where it wants to write further than it should. This kind of thing is very common and usually easy to debug by someone who knows the code once they know exactly what line of code is problematic. Since it shows exactly where the memory was allocated and exactly where it went out of bounds.BarryOn Nov 1, 2020, at 1:21 AM, Marius Buerkle <[email protected]> wrote:<valgrind.tar.gz>Hi,I cannot use on_error_attach_debugger as X forwarding does not work on the system. Is it possible to dump the gdb output to file instead?I run it through valgrind. It seems there is some problem during calls in superlu_dist but I don't know if this eventually causes the seg fault. I think this is the relevant output:==43569== Conditional jump or move depends on uninitialised value(s)
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Use of uninitialised value of size 8
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Use of uninitialised value of size 8
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Invalid write of size 8
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==
==43569== Invalid write of size 8
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
==43569== by 0x40465D: main (superlu_test.c:59)
==43569==I also attached the whole log. Does this make any sense? The problem seems to be around where I get the original segfault.best,mariusGesendet: Samstag, 31. Oktober 2020 um 04:07 Uhr
Von: "Barry Smith" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Xiaoye S. Li" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfaultHave you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption.If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.BarryOn Oct 29, 2020, at 1:04 AM, Marius Buerkle <[email protected]> wrote:Hi Sherry,I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help.Here is the output I can get from SuperLu during the PETSC runNonzeros in L 29519630
Nonzeros in U 29519630
nonzeros in L+U 58996711
nonzeros in LSUB 4509612** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 952.18 | Total : 1980.60
** Total highmark (MB):
Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56
**************************************************
**************************************************
**** Time (seconds) ****
EQUIL time 0.06
ROWPERM time 1.03
COLPERM time 1.01
SYMBFACT time 0.45
DISTRIBUTE time 0.33
FACTOR time 0.90
Factor flops 2.225916e+11 Mflops 247438.62
SOLVE time 0.000
**************************************************I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr
Von: "Xiaoye S. Li" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "Zhang, Hong" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: Re: Re: [petsc-users] superlu_dist segfaultHong: thanks for the diagnosis!Marius: how many OpenMP threads are you using per MPI task?In an earlier email, you mentioned the allocation failure at the following line:if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.
The RHS and X memories are easy to compute. However, in order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.SherryOn Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <[email protected]> wrote:Thanks for the swift reply.
I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?
best,
marius
Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr
Von: "Zhang, Hong" <[email protected]>
An: "Marius Buerkle" <[email protected]>
Cc: "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: Re: [petsc-users] superlu_dist segfaultMarius,I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of rangeThe modified code I used is attached.Hong
From: Marius Buerkle <[email protected]>
Sent: Tuesday, October 27, 2020 10:01 PM
To: Zhang, Hong <[email protected]>
Cc: [email protected] <[email protected]>; Sherry Li <[email protected]>
Subject: Aw: Re: [petsc-users] superlu_dist segfaultHi,I recompiled PETSC with debug option, now I get a seg fault at a different position[23]PETSC ERROR: ------------------------------------------------------------------------
[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[23]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[23]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[23]PETSC ERROR: likely location of problem given in stack below
[23]PETSC ERROR: --------------------- Stack Frames ------------------------------------
[23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[23]PETSC ERROR: INSTEAD the line number of the start of the function
[23]PETSC ERROR: is given.
[23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
[23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[23]PETSC ERROR: Signal receivedI made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloudBest,MariusGesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr
Von: "Zhang, Hong" <[email protected]>
An: "Marius Buerkle" <[email protected]>, "[email protected]" <[email protected]>, "Sherry Li" <[email protected]>
Betreff: Re: [petsc-users] superlu_dist segfaultMarius,It fails at the line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.cif ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");We do not know what it means. You may use a debugger to check the values of the variables involved.I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.Hong
From: petsc-users <[email protected]> on behalf of Marius Buerkle <[email protected]>
Sent: Tuesday, October 27, 2020 8:46 AM
To: [email protected] <[email protected]>
Subject: [petsc-users] superlu_dist segfaultHi,When using MatMatSolve with superlu_dist I get a segmentation fault:Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.cThe matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.Best,Marius--Stefano
