Re: [petsc-users] KSP_Solve crashes in debug mode

Sajid Ali Syed via petsc-users Fri, 24 Feb 2023 09:39:23 -0800

Hi Barry,

The application calls PetscCallAbort in a loop, i.e.


for i in range:
  void routine(PetscCallAbort(function_returning_petsc_error_code))

From the prior logs it looks like the stack grows every time PetscCallAbort is 
called (in other words, the stack does not shrink upon successful exit from 
PetscCallAbort).

Is this usage pattern not recommended? Should I be manually checking for 
success of the `function_returning_petsc_error_code` and throw instead of 
relying on PetscCallAbort?



Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>

________________________________
From: Barry Smith <bsm...@petsc.dev>
Sent: Wednesday, February 22, 2023 6:49 PM
To: Sajid Ali Syed <sas...@fnal.gov>
Cc: Matthew Knepley <knep...@gmail.com>; petsc-users@mcs.anl.gov 
<petsc-users@mcs.anl.gov>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


  Hmm, there could be a bug in our handling of the stack when reaches the 
maximum. It is suppose to just stop collecting additional levels at that point 
but likely it has not been tested since a lot of refactorizations.

   What are you doing to have so many stack frames?

On Feb 22, 2023, at 6:32 PM, Sajid Ali Syed <sas...@fnal.gov> wrote:

Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_petsc_petsc_blob_main_include_petscerror.h-23L1153&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR&s=Rfmp69z-e_VacDf-D0n8jt0xA6qq7oRBfgFSgMn1Dj8&e=>


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR&s=KDcd2SRT062jOa-0d8hvQywGEvYtyx9oHol5xp4XMI8&e=>

________________________________
From: Matthew Knepley <knep...@gmail.com<mailto:knep...@gmail.com>>
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>>
Cc: Barry Smith <bsm...@petsc.dev<mailto:bsm...@petsc.dev>>; 
petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

     Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>

________________________________
From: Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith <bsm...@petsc.dev<mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo&e=>

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>

________________________________
From: Barry Smith <bsm...@petsc.dev<mailto:bsm...@petsc.dev>>
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII&e=>
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley 
<knep...@gmail.com<mailto:knep...@gmail.com>> wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
    frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
   598               `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599      @*/
   600      PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601      {
   602       PetscMPIInt rank;
   603      
   604       PetscFunctionBegin;
(lldb) frame info
frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
(lldb)


The trace seems to indicate some sort of infinite loop causing an overflow.

Yes, I have also seen this. What happens is that we have a memory error. The 
error is reported inside PetscMallocValidate()
using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls 
PetscMallocValidate again, which fails. We need to
remove all error checking from the prints inside Validate.

  Thanks,

     Matt


PS: I'm using a arm64 mac, so I don't have access to valgrind.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=JA1u9AHcO8HqY5oCgbEy-ghtKRjURlRDwdmxP-9YJac&e=>


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=CdEZKWQbBYiD2pzU3Az_EDIGUTBNkNHwSoD2n_2098Y&e=>



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Hkn4IxPABZIeY0m9o_VGFHJ4ntffqbtyd3fddpbZw7I&e=>

Re: [petsc-users] KSP_Solve crashes in debug mode

Reply via email to