Hmm, here is the macro
#define PetscCallAbort(comm, ...) \
do { \
PetscErrorCode ierr_petsc_call_abort_; \
PetscStackUpdateLine; \
ierr_petsc_call_abort_ = __VA_ARGS__; \
if (PetscUnlikely(ierr_petsc_call_abort_ != PETSC_SUCCESS)) { \
ierr_petsc_call_abort_ = PetscError(PETSC_COMM_SELF, __LINE__,
PETSC_FUNCTION_NAME, __FILE__, ierr_petsc_call_abort_, PETSC_ERROR_REPEAT, "
"); \
(void)MPI_Abort(comm, (PetscMPIInt)ierr_petsc_call_abort_); \
} \
} while (0)
it does not seem to increment anything in the stack. So I think call should be
ok
Perhaps your function has a PetscFunctionBegin, but no PetscFunctionReturn() or
in some other way increase the stack size without decreasing it?
> On Feb 24, 2023, at 12:39 PM, Sajid Ali Syed <[email protected]> wrote:
>
> Hi Barry,
>
> The application calls PetscCallAbort in a loop, i.e.
>
> for i in range:
> void routine(PetscCallAbort(function_returning_petsc_error_code))
>
> From the prior logs it looks like the stack grows every time PetscCallAbort
> is called (in other words, the stack does not shrink upon successful exit
> from PetscCallAbort).
>
> Is this usage pattern not recommended? Should I be manually checking for
> success of the `function_returning_petsc_error_code` and throw instead of
> relying on PetscCallAbort?
>
>
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io <http://s-sajid-ali.github.io/>
> From: Barry Smith <[email protected] <mailto:[email protected]>>
> Sent: Wednesday, February 22, 2023 6:49 PM
> To: Sajid Ali Syed <[email protected] <mailto:[email protected]>>
> Cc: Matthew Knepley <[email protected] <mailto:[email protected]>>;
> [email protected] <mailto:[email protected]>
> <[email protected] <mailto:[email protected]>>
> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>
>
> Hmm, there could be a bug in our handling of the stack when reaches the
> maximum. It is suppose to just stop collecting additional levels at that
> point but likely it has not been tested since a lot of refactorizations.
>
> What are you doing to have so many stack frames?
>
>> On Feb 22, 2023, at 6:32 PM, Sajid Ali Syed <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hi Matt,
>>
>> Adding `-checkstack` does not prevent the crash, both on my laptop and on
>> the cluster.
>>
>> What does prevent the crash (on my laptop at least) is changing
>> `PETSCSTACKSIZE` from 64 to 256 here :
>> https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_petsc_petsc_blob_main_include_petscerror.h-23L1153&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR&s=Rfmp69z-e_VacDf-D0n8jt0xA6qq7oRBfgFSgMn1Dj8&e=>
>>
>>
>> Thank You,
>> Sajid Ali (he/him) | Research Associate
>> Data Science, Simulation, and Learning Division
>> Fermi National Accelerator Laboratory
>> s-sajid-ali.github.io
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR&s=KDcd2SRT062jOa-0d8hvQywGEvYtyx9oHol5xp4XMI8&e=>
>> From: Matthew Knepley <[email protected] <mailto:[email protected]>>
>> Sent: Wednesday, February 22, 2023 5:23 PM
>> To: Sajid Ali Syed <[email protected] <mailto:[email protected]>>
>> Cc: Barry Smith <[email protected] <mailto:[email protected]>>;
>> [email protected] <mailto:[email protected]>
>> <[email protected] <mailto:[email protected]>>
>> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>>
>> On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users
>> <[email protected] <mailto:[email protected]>> wrote:
>> One thing to note in relation to the trace attached in the previous email is
>> that there are no warnings until the 36th call to KSP_Solve. The first error
>> (as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve
>> (part of what the application marks as turn 10 of the propagator). The crash
>> finally occurs on the 43rd call to KSP_solve.
>>
>> Looking at the trace, it appears that stack handling is messed up and
>> eventually it causes the crash. This can happen when
>> PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try
>> running this with
>>
>> -checkstack
>>
>> Thanks,
>>
>> Matt
>>
>> Thank You,
>> Sajid Ali (he/him) | Research Associate
>> Data Science, Simulation, and Learning Division
>> Fermi National Accelerator Laboratory
>> s-sajid-ali.github.io
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>
>> From: Sajid Ali Syed <[email protected] <mailto:[email protected]>>
>> Sent: Wednesday, February 22, 2023 5:11 PM
>> To: Barry Smith <[email protected] <mailto:[email protected]>>
>> Cc: [email protected] <mailto:[email protected]>
>> <[email protected] <mailto:[email protected]>>
>> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>>
>> Hi Barry,
>>
>> Thanks a lot for fixing this issue. I ran the same problem on a linux
>> machine and have the following trace for the same crash (with ASAN turned on
>> for both PETSc (on the latest commit of the branch) and the application) :
>> https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo&e=>
>>
>> The trace seems to indicate a couple of buffer overflows, one of which
>> causes the crash. I'm not sure as to what causes them.
>>
>> Thank You,
>> Sajid Ali (he/him) | Research Associate
>> Data Science, Simulation, and Learning Division
>> Fermi National Accelerator Laboratory
>> s-sajid-ali.github.io
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>
>> From: Barry Smith <[email protected] <mailto:[email protected]>>
>> Sent: Wednesday, February 15, 2023 2:01 PM
>> To: Sajid Ali Syed <[email protected] <mailto:[email protected]>>
>> Cc: [email protected] <mailto:[email protected]>
>> <[email protected] <mailto:[email protected]>>
>> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>>
>>
>> https://gitlab.com/petsc/petsc/-/merge_requests/6075
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII&e=>
>> should fix the possible recursive error condition Matt pointed out
>>
>>
>>> On Feb 9, 2023, at 6:24 PM, Matthew Knepley <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace
>>> from lldb is in the attached file. The crash now seems to be at:
>>>
>>> Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop
>>> reason = EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
>>> frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0,
>>> fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
>>> 598 `PetscViewerASCIISynchronizedPrintf()`,
>>> `PetscSynchronizedFlush()`
>>> 599 @*/
>>> 600 PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char
>>> format[], ...)
>>> -> 601 {
>>> 602 PetscMPIInt rank;
>>> 603
>>> 604 PetscFunctionBegin;
>>> (lldb) frame info
>>> frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0,
>>> fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
>>> (lldb)
>>> The trace seems to indicate some sort of infinite loop causing an overflow.
>>>
>>>
>>> Yes, I have also seen this. What happens is that we have a memory error.
>>> The error is reported inside PetscMallocValidate()
>>> using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls
>>> PetscMallocValidate again, which fails. We need to
>>> remove all error checking from the prints inside Validate.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>> PS: I'm using a arm64 mac, so I don't have access to valgrind.
>>>
>>> Thank You,
>>> Sajid Ali (he/him) | Research Associate
>>> Scientific Computing Division
>>> Fermi National Accelerator Laboratory
>>> s-sajid-ali.github.io
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=JA1u9AHcO8HqY5oCgbEy-ghtKRjURlRDwdmxP-9YJac&e=>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=CdEZKWQbBYiD2pzU3Az_EDIGUTBNkNHwSoD2n_2098Y&e=>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their experiments
>> is infinitely more interesting than any results to which their experiments
>> lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Hkn4IxPABZIeY0m9o_VGFHJ4ntffqbtyd3fddpbZw7I&e=>