Code?
> On Nov 2, 2020, at 9:27 AM, Marius Buerkle <[email protected]> wrote:
>
>
> The matrix is a bit too big for email attachment, I put it on onedrive
> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw
> <https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw>
>
>
> Gesendet: Montag, 02. November 2020 um 23:58 Uhr
> Von: "Barry Smith" <[email protected]>
> An: "Marius Buerkle" <[email protected]>
> Cc: "Stefano Zampini" <[email protected]>, "[email protected]"
> <[email protected]>, "Sherry Li" <[email protected]>
> Betreff: Re: [petsc-users] superlu_dist segfault
>
> Please send this program and your data file. This should definitely not be
> happening.
>
> Barry
>
> Valgrind is generally trustworthy.
>
> On Nov 2, 2020, at 12:21 AM, Marius Buerkle <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hi,
>
> I tried valgrind with track-origins, valgrind crashes at somepoint due to
> running out of energy though. But before I get a lot of
> "Conditional jump or move depends on uninitialised value(s)" and "Use of
> uninitialised value of size 8" not all of them related to Petsc but some of
> them are during MatLoad, PCSetup_LU, and also in Superlu. For example
>
> ==41867== Conditional jump or move depends on uninitialised value(s)
> ==41867== at 0x5DEA7C4: MatSetValues_MPIAIJ (mpiaij.c:601)
> ==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)
> ==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)
> ==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)
> ==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
> ==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
> ==41867== by 0x4063ED: main (superlu_test.c:28)
> ==41867== Uninitialised value was created by a heap allocation
> ==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
> ==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)
> ==41867== by 0x50242D4: PetscMallocA (mal.c:425)
> ==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)
> ==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
> ==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
> ==41867== by 0x4063ED: main (superlu_test.c:28)
> ==41867==
> ==41867== Use of uninitialised value of size 8
> ==41867== at 0x5DEA8AE: MatSetValues_MPIAIJ (mpiaij.c:603)
> ==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)
> ==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)
> ==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)
> ==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
> ==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
> ==41867== by 0x4063ED: main (superlu_test.c:28)
> ==41867== Uninitialised value was created by a heap allocation
> ==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
> ==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)
> ==41867== by 0x50242D4: PetscMallocA (mal.c:425)
> ==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)
> ==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)
> ==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)
> ==41867== by 0x4063ED: main (superlu_test.c:28)
>
> I don't know if this are real errors or only some problem of valgrind. I
> attached th whole valgrind logs, they are rather noisy though.
>
> Best,
> Marius
>
>
> Gesendet: Sonntag, 01. November 2020 um 19:09 Uhr
> Von: "Stefano Zampini" <[email protected]
> <mailto:[email protected]>>
> An: "Barry Smith" <[email protected] <mailto:[email protected]>>
> Cc: "Marius Buerkle" <[email protected] <mailto:[email protected]>>,
> "[email protected] <mailto:[email protected]>"
> <[email protected] <mailto:[email protected]>>, "Sherry Li"
> <[email protected] <mailto:[email protected]>>
> Betreff: Re: [petsc-users] superlu_dist segfault
> More importantly,
>
> ==43569== Conditional jump or move depends on uninitialised value(s)
> ==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
>
> You should run using valgrind's option --track-origins=yes to understand the
> reason for this.
>
> Il giorno dom 1 nov 2020 alle ore 11:53 Barry Smith <[email protected]
> <mailto:[email protected]>> ha scritto:
>
>
> You can sometimes use -on_error_attach_debugger noxterm and it will try to
> attach just in the console you started the job. If you are lucky this works
> and you use bt and see the stack and look at variables. But if multiple ranks
> crash the debugger will get confused and even if only one crashes if it is
> not rank zero the stty can get messed up so you cannot type to control the
> debugger.
>
> The valgrind information is very valuable, likely Sherry can look at those
> lines and have a really good idea what the problem is, for example,
>
> Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
> means that for some reason the code is writing past the end of an allocated
> array, either because the array allocated was not long enough or the code has
> some issue where it wants to write further than it should. This kind of thing
> is very common and usually easy to debug by someone who knows the code once
> they know exactly what line of code is problematic. Since it shows exactly
> where the memory was allocated and exactly where it went out of bounds.
>
> Barry
>
>
> On Nov 1, 2020, at 1:21 AM, Marius Buerkle <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hi,
>
> I cannot use on_error_attach_debugger as X forwarding does not work on the
> system. Is it possible to dump the gdb output to file instead?
>
> I run it through valgrind. It seems there is some problem during calls in
> superlu_dist but I don't know if this eventually causes the seg fault. I
> think this is the relevant output:
>
> ==43569== Conditional jump or move depends on uninitialised value(s)
> ==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Use of uninitialised value of size 8
> ==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Use of uninitialised value of size 8
> ==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Invalid write of size 8
> ==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
> ==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
> ==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
> ==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
> ==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Invalid write of size 8
> ==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd
> ==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
> ==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
> ==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
> ==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
>
> I also attached the whole log. Does this make any sense? The problem seems to
> be around where I get the original segfault.
>
> best,
> marius
>
>
> Gesendet: Samstag, 31. Oktober 2020 um 04:07 Uhr
> Von: "Barry Smith" <[email protected] <mailto:[email protected]>>
> An: "Marius Buerkle" <[email protected] <mailto:[email protected]>>
> Cc: "Xiaoye S. Li" <[email protected] <mailto:[email protected]>>,
> "[email protected] <mailto:[email protected]>"
> <[email protected] <mailto:[email protected]>>, "Sherry Li"
> <[email protected] <mailto:[email protected]>>
> Betreff: Re: [petsc-users] superlu_dist segfault
>
> Have you run it yet with valgrind, good be memory corruption earlier that
> causes a later crash, crashes that occur at different places for the same run
> are almost always due to memory corruption.
>
> If valgrind is clean you can run with -on_error_attach_debugger and if the
> X forwarding is set up it will open a debugger on the crashing process and
> you can type bt to see exactly where it is crashing, at what line number and
> code line.
>
> Barry
>
>
> On Oct 29, 2020, at 1:04 AM, Marius Buerkle <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hi Sherry,
>
> I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with
> OpenMP turned off. But did not help.
>
> Here is the output I can get from SuperLu during the PETSC run
> Nonzeros in L 29519630
> Nonzeros in U 29519630
> nonzeros in L+U 58996711
> nonzeros in LSUB 4509612
> ** Memory Usage **********************************
> ** NUMfact space (MB): (sum-of-all-processes)
> L\U : 952.18 | Total : 1980.60
> ** Total highmark (MB):
> Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56
> **************************************************
> **************************************************
> **** Time (seconds) ****
> EQUIL time 0.06
> ROWPERM time 1.03
> COLPERM time 1.01
> SYMBFACT time 0.45
> DISTRIBUTE time 0.33
> FACTOR time 0.90
> Factor flops 2.225916e+11 Mflops 247438.62
> SOLVE time 0.000
> **************************************************
>
> I tried all available ordering options for Colperm
> (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which
> always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the
> same seg. fault.
>
>
> Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr
> Von: "Xiaoye S. Li" <[email protected] <mailto:[email protected]>>
> An: "Marius Buerkle" <[email protected] <mailto:[email protected]>>
> Cc: "Zhang, Hong" <[email protected] <mailto:[email protected]>>,
> "[email protected] <mailto:[email protected]>"
> <[email protected] <mailto:[email protected]>>, "Sherry Li"
> <[email protected] <mailto:[email protected]>>
> Betreff: Re: Re: Re: [petsc-users] superlu_dist segfault
> Hong: thanks for the diagnosis!
>
> Marius: how many OpenMP threads are you using per MPI task?
> In an earlier email, you mentioned the allocation failure at the following
> line:
> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread *
> sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");
>
> this is in the solve phase. I think when we do some OpenMP optimization, we
> allowed several data structures to grow with OpenMP threads. You can try to
> use 1 thread.
>
> The RHS and X memories are easy to compute. However, in order to gauge how
> much memory is used in the factorization, can you print out the number of
> nonzeros in the L and U factors? What ordering option are you using? The
> sparse matrix A looks pretty small.
>
> The code can also print out the working storage used during factorization. I
> am not sure how this printing can be turned on through PETSc.
>
> Sherry
>
> On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <[email protected]
> <mailto:[email protected]>> wrote:
> Thanks for the swift reply.
>
> I also realized if I reduce the number of RHS then it works. But I am running
> the code on a cluster with 256GB ram / node. One dense matrix would be
> around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one
> node and I also get the seg fault if I run it on several nodes. Moreover, it
> works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when
> using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it
> crashed even before reaching the solver phase. Could there be such a large
> difference in memory usage between SuperLu_dist and MUMPS ?
>
>
> best,
>
> marius
>
>
> Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr
> Von: "Zhang, Hong" <[email protected] <mailto:[email protected]>>
> An: "Marius Buerkle" <[email protected] <mailto:[email protected]>>
> Cc: "[email protected] <mailto:[email protected]>"
> <[email protected] <mailto:[email protected]>>, "Sherry Li"
> <[email protected] <mailto:[email protected]>>
> Betreff: Re: Re: [petsc-users] superlu_dist segfault
> Marius,
> I tested your code with petsc-release on my mac laptop using np=2 cores. I
> first tested a small matrix data file successfully. Then I switch to your
> data file and run out of memory, likely due to the dense matrices B and X. I
> got an error "Your system has run out of application memory" from my laptop.
>
> The sparse matrix A has size 42549 by 42549. Your code creates dense matrices
> B and X with the same size -- a huge memory requirement!
> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code
> run well with np=2. Note the error message you got
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
>
> The modified code I used is attached.
> Hong
>
> From: Marius Buerkle <[email protected] <mailto:[email protected]>>
> Sent: Tuesday, October 27, 2020 10:01 PM
> To: Zhang, Hong <[email protected] <mailto:[email protected]>>
> Cc: [email protected] <mailto:[email protected]>
> <[email protected] <mailto:[email protected]>>; Sherry Li
> <[email protected] <mailto:[email protected]>>
> Subject: Aw: Re: [petsc-users] superlu_dist segfault
>
> Hi,
>
> I recompiled PETSC with debug option, now I get a seg fault at a different
> position
>
> [23]PETSC ERROR:
> ------------------------------------------------------------------------
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [23]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [23]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on
> GNU/linux and Apple Mac OS X to find memory corruption errors
> [23]PETSC ERROR: likely location of problem given in stack below
> [23]PETSC ERROR: --------------------- Stack Frames
> ------------------------------------
> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [23]PETSC ERROR: INSTEAD the line number of the start of the function
> [23]PETSC ERROR: is given.
> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242
> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211
> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve line 3466
> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
> [23]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [23]PETSC ERROR: Signal received
>
> I made a small reproducer. The matrix is a bit too big so I cannot attach it
> directly to the email, but I put it in the cloud
> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw
> <https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw>
>
> Best,
> Marius
>
>
> Gesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr
> Von: "Zhang, Hong" <[email protected] <mailto:[email protected]>>
> An: "Marius Buerkle" <[email protected] <mailto:[email protected]>>,
> "[email protected] <mailto:[email protected]>"
> <[email protected] <mailto:[email protected]>>, "Sherry Li"
> <[email protected] <mailto:[email protected]>>
> Betreff: Re: [petsc-users] superlu_dist segfault
> Marius,
> It fails at the line 1075 in file
> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread *
> sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");
>
> We do not know what it means. You may use a debugger to check the values of
> the variables involved.
> I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone
> short code that reproduce the error. We can help on its investigation.
> Hong
>
>
> From: petsc-users <[email protected]
> <mailto:[email protected]>> on behalf of Marius Buerkle
> <[email protected] <mailto:[email protected]>>
> Sent: Tuesday, October 27, 2020 8:46 AM
> To: [email protected] <mailto:[email protected]>
> <[email protected] <mailto:[email protected]>>
> Subject: [petsc-users] superlu_dist segfault
>
> Hi,
>
> When using MatMatSolve with superlu_dist I get a segmentation fault:
>
> Malloc fails for lsum[]. at line 1075 in file
> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>
> The matrix size is not particular big and I am using the petsc release branch
> and superlu_dist is v6.3.0 I think.
>
> Best,
> Marius
> <valgrind.tar.gz>
>
>
> --
> Stefano
> <valgrind_track-origins.tar.gz>