Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-24 Thread Sajid Ali Syed via petsc-users
Hi Barry,

The application calls PetscCallAbort in a loop, i.e.

for i in range:
  void routine(PetscCallAbort(function_returning_petsc_error_code))

From the prior logs it looks like the stack grows every time PetscCallAbort is 
called (in other words, the stack does not shrink upon successful exit from 
PetscCallAbort).

Is this usage pattern not recommended? Should I be manually checking for 
success of the `function_returning_petsc_error_code` and throw instead of 
relying on PetscCallAbort?



Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Barry Smith 
Sent: Wednesday, February 22, 2023 6:49 PM
To: Sajid Ali Syed 
Cc: Matthew Knepley ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


  Hmm, there could be a bug in our handling of the stack when reaches the 
maximum. It is suppose to just stop collecting additional levels at that point 
but likely it has not been tested since a lot of refactorizations.

   What are you doing to have so many stack frames?

On Feb 22, 2023, at 6:32 PM, Sajid Ali Syed  wrote:

Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_petsc_petsc_blob_main_include_petscerror.h-23L1153=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR=Rfmp69z-e_VacDf-D0n8jt0xA6qq7oRBfgFSgMn1Dj8=>


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR=KDcd2SRT062jOa-0d8hvQywGEvYtyx9oHol5xp4XMI8=>


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: Barry Smith mailto:bsm...@petsc.dev>>; 
petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M=>


From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo=>

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefen

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Via a checkpoint in `PetscOptionsCheckInitial_Private`, I can confirm that 
`checkstack` is set to `PETSC_TRUE` and this leads to no (additional) 
information about erroneous stack handling.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Sajid Ali Syed 
Sent: Wednesday, February 22, 2023 6:34 PM
To: Matthew Knepley 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Matt,

This is a trace from the same crash, but with `-checkstack` included in 
.petscrc​ : https://gist.github.com/s-sajid-ali/455b3982d47a31bff9e7ee211dd43991


I don't see any additional information regarding the possible cause.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Matthew Knepley 
Sent: Wednesday, February 22, 2023 6:28 PM
To: Sajid Ali Syed 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:32 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:
Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

It will not prevent a crash. The output is intended to show us where the stack 
problem originates. Can you send the output?

  Thanks,

Matt

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_petsc_petsc_blob_main_include_petscerror.h-23L1153=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=4O4e5ZFa8DyEfzEsYIIKt360UctxCt5KFvZxW811iBIqyNqK5hsslNaRmF9EKE7s=MnNgqJmNQ3g1IfdK3RejS4PSU0SSoTij4l0SxKbRCKM=>


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=4O4e5ZFa8DyEfzEsYIIKt360UctxCt5KFvZxW811iBIqyNqK5hsslNaRmF9EKE7s=8q2mKS3nKLeZnpV5HG37TS3K7ZYQ3hglpSFoDJHXJ3g=>


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: Barry Smith mailto:bsm...@petsc.dev>>; 
petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M=>


From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo=>

The trace seems to indicate a couple of buffer overflows, one of w

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Hi Matt,

This is a trace from the same crash, but with `-checkstack` included in 
.petscrc​ : https://gist.github.com/s-sajid-ali/455b3982d47a31bff9e7ee211dd43991


I don't see any additional information regarding the possible cause.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Matthew Knepley 
Sent: Wednesday, February 22, 2023 6:28 PM
To: Sajid Ali Syed 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:32 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:
Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

It will not prevent a crash. The output is intended to show us where the stack 
problem originates. Can you send the output?

  Thanks,

Matt

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_petsc_petsc_blob_main_include_petscerror.h-23L1153=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=4O4e5ZFa8DyEfzEsYIIKt360UctxCt5KFvZxW811iBIqyNqK5hsslNaRmF9EKE7s=MnNgqJmNQ3g1IfdK3RejS4PSU0SSoTij4l0SxKbRCKM=>


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=4O4e5ZFa8DyEfzEsYIIKt360UctxCt5KFvZxW811iBIqyNqK5hsslNaRmF9EKE7s=8q2mKS3nKLeZnpV5HG37TS3K7ZYQ3hglpSFoDJHXJ3g=>


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: Barry Smith mailto:bsm...@petsc.dev>>; 
petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M=>


From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo=>

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M=>

________
From: Barry Smith mailto:bsm...@petsc.dev>>
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali S

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Matthew Knepley 
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M=>

________
From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo=>

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M=>


From: Barry Smith mailto:bsm...@petsc.dev>>
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII=>
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
   598   `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599  @*/
   600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601  {
   602   PetscMPIInt rank;
   603  
   604   PetscFunctionBegin;
(lldb) frame info
frame #0: 0x000112ecc8f8 libpetsc.3.018.

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Sajid Ali Syed 
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Barry Smith 
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII=>
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley  wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
   598   `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599  @*/
   600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601  {
   602   PetscMPIInt rank;
   603  
   604   PetscFunctionBegin;
(lldb) frame info
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
(lldb)


The trace seems to indicate some sort of infinite loop causing an overflow.

Yes, I have also seen this. What happens is that we have a memory error. The 
error is reported inside PetscMallocValidate()
using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls 
PetscMallocValidate again, which fails. We need to
remove all error checking from the prints inside Validate.

  Thanks,

 Matt


PS: I'm using a arm64 mac, so I don't have access to valgrind.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=JA1u9AHcO8HqY5oCgbEy-ghtKRjURlRDwdmxP-9YJac=>


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=CdEZKWQbBYiD2pzU3Az_EDIGUTBNkNHwSoD2n_2098Y=>



Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Barry Smith 
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII=>
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley  wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
   598   `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599  @*/
   600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601  {
   602   PetscMPIInt rank;
   603  
   604   PetscFunctionBegin;
(lldb) frame info
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
(lldb)


The trace seems to indicate some sort of infinite loop causing an overflow.

Yes, I have also seen this. What happens is that we have a memory error. The 
error is reported inside PetscMallocValidate()
using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls 
PetscMallocValidate again, which fails. We need to
remove all error checking from the prints inside Validate.

  Thanks,

 Matt


PS: I'm using a arm64 mac, so I don't have access to valgrind.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=JA1u9AHcO8HqY5oCgbEy-ghtKRjURlRDwdmxP-9YJac=>


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX=CdEZKWQbBYiD2pzU3Az_EDIGUTBNkNHwSoD2n_2098Y=>



Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-09 Thread Sajid Ali Syed via petsc-users
I’ve also printed out the head struct in the debugger, and it looks like this:

(lldb) print (TRSPACE)*head(TRSPACE) $7 = {
  size = 16
  rsize = 16
  id = 12063
  lineno = 217
  filename = 0x0001167fd865 
"/Users/sasyed/Documents/packages/petsc/src/sys/dll/reg.c"
  functionname = 0x0001167fde78 "PetscFunctionListDLAllPop_Private"
  classid = -253701943
  stack = {
function = {
  [0] = 0x00010189e2da "apply_bunch"
  [1] = 0x00010189e2da "apply_bunch"
  [2] = 0x00010189e2da "apply_bunch"
  [3] = 0x00010189e2da "apply_bunch"
  [4] = 0x00010189e2da "apply_bunch"
  [5] = 0x00010189e2da "apply_bunch"
  [6] = 0x00010189e2da "apply_bunch"
  [7] = 0x00010189e2da "apply_bunch"
  [8] = 0x00010189e2da "apply_bunch"
  [9] = 0x00010189e2da "apply_bunch"
  [10] = 0x00010189e2da "apply_bunch"
  [11] = 0x00010189e2da "apply_bunch"
  [12] = 0x00010189e2da "apply_bunch"
  [13] = 0x00010189e2da "apply_bunch"
  [14] = 0x00010189e2da "apply_bunch"
  [15] = 0x00010189e2da "apply_bunch"
  [16] = 0x00010189e2da "apply_bunch"
  [17] = 0x00010189e2da "apply_bunch"
  [18] = 0x00010189e2da "apply_bunch"
  [19] = 0x00010189e2da "apply_bunch"
  [20] = 0x00010189e2da "apply_bunch"
  [21] = 0x00010189e2da "apply_bunch"
  [22] = 0x00010189e2da "apply_bunch"
  [23] = 0x00010189e2da "apply_bunch"
  [24] = 0x00010189e2da "apply_bunch"
  [25] = 0x00010189e2da "apply_bunch"
  [26] = 0x00010189e2da "apply_bunch"
  [27] = 0x00010189e2da "apply_bunch"
  [28] = 0x00010189e2da "apply_bunch"
  [29] = 0x00010189e2da "apply_bunch"
  [30] = 0x00010189e2da "apply_bunch"
  [31] = 0x00010189e2da "apply_bunch"
  [32] = 0x00010189e2da "apply_bunch"
  [33] = 0x00010189e2da "apply_bunch"
  [34] = 0x00010189e2da "apply_bunch"
  [35] = 0x00010189e2da "apply_bunch"
  [36] = 0x00010189e2da "apply_bunch"
  [37] = 0x00010189e2da "apply_bunch"
  [38] = 0x00010189e2da "apply_bunch"
  [39] = 0x00010189e2da "apply_bunch"
  [40] = 0x00010189e2da "apply_bunch"
  [41] = 0x00010189e2da "apply_bunch"
  [42] = 0x00010189e2da "apply_bunch"
  [43] = 0x00010189e2da "apply_bunch"
  [44] = 0x00010189e2da "apply_bunch"
  [45] = 0x00010189e2da "apply_bunch"
  [46] = 0x00010189ebba "compute_mat"
  [47] = 0x00010189f0c3 "solve"
  [48] = 0x0001168b834c "KSPSolve"
  [49] = 0x0001168b89f7 "KSPSolve_Private"
  [50] = 0x0001168b395b "KSPSolve_GMRES"
  [51] = 0x0001168b37f8 "KSPGMRESCycle"
  [52] = 0x0001168ae4a7 "KSP_PCApplyBAorAB"
  [53] = 0x000116891b38 "PCApplyBAorAB"
  [54] = 0x0001168917ec "PCApply"
  [55] = 0x0001168a5337 "PCApply_MG"
  [56] = 0x0001168a5342 "PCApply_MG_Internal"
  [57] = 0x0001168a42e1 "PCMGMCycle_Private"
  [58] = 0x0001168b834c "KSPSolve"
  [59] = 0x0001168b89f7 "KSPSolve_Private"
  [60] = 0x00011682e396 "VecDestroy"
  [61] = 0x00011682d58e "VecDestroy_Seq"
  [62] = 0x0001168093fe "PetscObjectComposeFunction_Private"
  [63] = 0x000116809338 "PetscObjectComposeFunction_Petsc"
}
file = {
  [0] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [1] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [2] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [3] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [4] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [5] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [6] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [7] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [8] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [9] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [10] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [11] = 0x00010189e27f 
"/Users/sasyed/Documents/packages/synergia2/src/synergia/collective/space_charge_3d_fd.cc"
  [12] = 0x00010189e27f 

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-09 Thread Sajid Ali Syed via petsc-users
Hi Barry,

The lack of line numbers is due to the fact that this build of PETSc was done 
via spack which installs it in a temporary directory before moving it to the 
final location.

I have removed that build and installed PETSc locally (albeit with a simpler 
configuration) and see the same bug. Logs for this configuration and the error 
trace with this build are attached with this email.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Barry Smith 
Sent: Thursday, February 9, 2023 12:02 PM
To: Sajid Ali Syed 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


  Hmm, looks like your build may be funny? It is not in debug mode

frame #2: 0x00010eda20c8 libpetsc.3.018.dylib`PetscHeaderDestroy_Private + 
1436
frame #3: 0x00010f10176c libpetsc.3.018.dylib`VecDestroy + 808
frame #4: 0x000110199f34 libpetsc.3.018.dylib`KSPSolve_Private + 512

In debugger mode it would show the line numbers where the crash occurred and 
help us determine the problem. I do note the -g being used by the compilers so 
cannot explain off hand why it does not display the debug information.

  Barry


On Feb 9, 2023, at 12:42 PM, Sajid Ali Syed via petsc-users 
 wrote:


Hi PETSc-developers,

In our application we call KSP_Solve as part of a step to propagate a beam 
through a lattice. I am observing a crash within KSP_Solve for an application 
only after the 43rd call to KSP_Solve when building the application and PETSc 
in debug mode, full logs for which are attached with this email (1 MPI rank and 
4 OMP threads were used, but this crash occurs with multiple MPI ranks as well 
). I am also including the last few lines of the configuration for this build. 
This crash does not occur when building the application and PETSc in release 
mode.

Could someone tell me what causes this crash and if anything can be done to 
prevent it? Thanks in advance.

The configuration of this solver is here :  
https://github.com/fnalacceleratormodeling/synergia2/blob/sajid/features/openpmd_basic_integration/src/synergia/collective/space_charge_3d_fd_utils.cc#L273-L292<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fnalacceleratormodeling_synergia2_blob_sajid_features_openpmd-5Fbasic-5Fintegration_src_synergia_collective_space-5Fcharge-5F3d-5Ffd-5Futils.cc-23L273-2DL292=DwMFAg=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=fZRIP-srEU-rOVw461XeAmMY4_3VyZ1mujQae_3pw33K9hMFg-5yyo_sXsQJ_1dn=YToKoRAfim51m6aN-z4FlzfM0UDy46D7QJh8B11shTg=>

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_=DwMFAg=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=fZRIP-srEU-rOVw461XeAmMY4_3VyZ1mujQae_3pw33K9hMFg-5yyo_sXsQJ_1dn=g4Ch_bfIjw1n_m0d0zSXYc_eDI-4eJc9zSA_RfhUFOQ=>





configure_log_tail_local_install
Description: configure_log_tail_local_install


ksp_crash_log_local_install
Description: ksp_crash_log_local_install


Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-09 Thread Sajid Ali Syed via petsc-users
The configuration log is attached with this email.





configure_log_tail
Description: configure_log_tail


[petsc-users] KSP_Solve crashes in debug mode

2023-02-09 Thread Sajid Ali Syed via petsc-users
Hi PETSc-developers,

In our application we call KSP_Solve as part of a step to propagate a beam 
through a lattice. I am observing a crash within KSP_Solve for an application 
only after the 43rd call to KSP_Solve when building the application and PETSc 
in debug mode, full logs for which are attached with this email (1 MPI rank and 
4 OMP threads were used, but this crash occurs with multiple MPI ranks as well 
). I am also including the last few lines of the configuration for this build. 
This crash does not occur when building the application and PETSc in release 
mode.

Could someone tell me what causes this crash and if anything can be done to 
prevent it? Thanks in advance.

The configuration of this solver is here : 
https://github.com/fnalacceleratormodeling/synergia2/blob/sajid/features/openpmd_basic_integration/src/synergia/collective/space_charge_3d_fd_utils.cc#L273-L292

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io

​


ksp_crash_log
Description: ksp_crash_log


Re: [petsc-users] Regarding the status of VecSetValues(Blocked) for GPU vectors

2022-03-18 Thread Sajid Ali Syed
Hi Matt/Mark,

I'm working on a Poisson solver for a distributed PIC code, where the particles 
are distributed over MPI ranks rather than the grid. Prior to the solve, all 
particles are deposited onto a (DMDA) grid.

The current prototype I have is that each rank holds a full size DMDA vector 
and particles on that rank are deposited into it. Then, the data from all the 
local vectors in combined into multiple distributed DMDA vectors via 
VecScatters and this is followed by solving the Poisson equation. The need to 
have multiple subcomms, each solving the same equation is due to the fact that 
the grid size too small to use all the MPI ranks (beyond the strong scaling 
limit). The solution is then scattered back to each MPI rank via VecScatters.

This first local-to-(multi)global transfer required the use of multiple 
VecScatters as there is no one-to-multiple scatter capability in SF. This works 
and is already giving a large speedup over the current allreduce baseline 
(which transfers more data than is necessary) which is currently used.

I was wondering if within each subcommunicator I could directly write to the 
DMDA vector via VecSetValues and PETSc would take care of stashing them on the 
GPU until I call VecAssemblyBegin. Since this would be from within a kokkos 
parallel_for operation, there would be multiple (probably ~1e3) simultaneous 
writes that the stashing mechanism would have to support. Currently, we use 
Kokkos-ScatterView to do this.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Matthew Knepley 
Sent: Thursday, March 17, 2022 7:25 PM
To: Mark Adams 
Cc: Sajid Ali Syed ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] Regarding the status of VecSetValues(Blocked) for 
GPU vectors

On Thu, Mar 17, 2022 at 8:19 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
LocalToGlobal is a DM thing..
Sajid, do use DM?
If you need to add off procesor entries then DM could give you a local vector 
as Matt said that you can add to for off procesor values and then you could use 
the CPU communication in DM.

It would be GPU communication, not CPU.

   Matt

On Thu, Mar 17, 2022 at 7:19 PM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Thu, Mar 17, 2022 at 4:46 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:
Hi PETSc-developers,

Is it possible to use VecSetValues with distributed-memory CUDA & Kokkos 
vectors from the device, i.e. can I call VecSetValues with GPU memory pointers 
and expect PETSc to figure out how to stash on the device it until I call 
VecAssemblyBegin (at which point PETSc could use GPU-aware MPI to populate 
off-process values) ?

If this is not currently supported, is supporting this on the roadmap? Thanks 
in advance!

VecSetValues() will fall back to the CPU vector, so I do not think this will 
work on device.

Usually, our assembly computes all values and puts them in a "local" vector, 
which you can access explicitly as Mark said. Then
we call LocalToGlobal() to communicate the values, which does work directly on 
device using specialized code in VecScatter/PetscSF.

What are you trying to do?

  THanks,

  Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=0XVS3DAGXcfL8rzL8Bij70rxbfVtLXqvZC2kUPVoHUEquwZVQwgoBP_aHbei5owb=jaqSeHVty0Q2rK0mKuKQMyvcQGtqdOPN6wcZIGZ5_K4=>



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=0XVS3DAGXcfL8rzL8Bij70rxbfVtLXqvZC2kUPVoHUEquwZVQwgoBP_aHbei5owb=CoW4LB9JyQtsc-D24RRWHnnDdNjSnjwZ4FPrLmaIZhc=>


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=0XVS3DAGXcfL8rzL8Bij70rxbfVtLXqvZC2kUPVoHUEquwZVQwgoBP_aHbei5owb=CoW4LB9JyQtsc-D24RRWHnnDdNjSnjwZ4FPrLmaIZhc=>


[petsc-users] Regarding the status of VecSetValues(Blocked) for GPU vectors

2022-03-17 Thread Sajid Ali Syed
Hi PETSc-developers,

Is it possible to use VecSetValues with distributed-memory CUDA & Kokkos 
vectors from the device, i.e. can I call VecSetValues with GPU memory pointers 
and expect PETSc to figure out how to stash on the device it until I call 
VecAssemblyBegin (at which point PETSc could use GPU-aware MPI to populate 
off-process values) ?

If this is not currently supported, is supporting this on the roadmap? Thanks 
in advance!

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io



Re: [petsc-users] GAMG crash during setup when using multiple GPUs

2022-02-11 Thread Sajid Ali Syed
Hi Mark,

Thanks for the information.

@Junchao: Given that there are known issues with GPU aware MPI, it might be 
best to wait until there is an updated version of cray-mpich (which hopefully 
contains the relevant fixes).

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>


From: Mark Adams 
Sent: Thursday, February 10, 2022 8:47 PM
To: Junchao Zhang 
Cc: Sajid Ali Syed ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Perlmutter has problems with GPU aware MPI.
This is being actively worked on at NERSc.

Mark

On Thu, Feb 10, 2022 at 9:22 PM Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:
Hi, Sajid Ali,
  I have no clue. I have access to perlmutter.  I am thinking how to debug that.
  If your app is open-sourced and easy to build, then I can build and debug it. 
Otherwise, suppose you build and install petsc (only with options needed by 
your app) to a shared directory, and I can access your executable (which uses 
RPATH for libraries), then maybe I can debug it (I only need to install my own 
petsc to the shared directory)

--Junchao Zhang


On Thu, Feb 10, 2022 at 6:04 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:
Hi Junchao,

With "-use_gpu_aware_mpi 0" there is no error. I'm attaching the log for this 
case with this email.

I also ran with gpu aware mpi to see if I could reproduce the error and got the 
error but from a different location. This logfile is also attached.

This was using the newest cray-mpich on NERSC-perlmutter (8.1.12). Let me know 
if I can share further information to help with debugging this.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=Fea4VIbc4UoqdTFjAk3kg3Hp94LYXkjR3gHIdP08lMeT-3zEDZNKDcHjRejBIggW=ezCw13eIYUcCzUki3rlnpGZWZrdcTxlGpG57GqrEz_s=>


From: Junchao Zhang mailto:junchao.zh...@gmail.com>>
Sent: Thursday, February 10, 2022 1:43 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Also, try "-use_gpu_aware_mpi 0" to see if there is a difference.

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:40 PM Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:
Did it fail without GPU at 64 MPI ranks?

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:22 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:

Hi PETSc-developers,

I’m seeing the following crash that occurs during the setup phase of the 
preconditioner when using multiple GPUs. The relevant error trace is shown 
below:

(GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, 
CUDA_ERROR_ALREADY_MAPPED, line no 272
[24]PETSC ERROR: - Error Message 
--
[24]PETSC ERROR: General MPI error
[24]PETSC ERROR: MPI error 1 Invalid buffer pointer
[24]PETSC ERROR: See 
https://petsc.org/release/faq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__petsc.org_release_faq_=DwMFaQ=gRgGjJ3BkIsb5y6s49QqsA=w-DPglgoOUOz8eiEyHKz0g=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN=ZpvtorGvQdUD8O-wLBTUYUUb6-Kccver8Cc4kXlZ7J0=>
 for trouble shooting.
[24]PETSC ERROR: Petsc Development GIT revision: 
f351d5494b5462f62c419e00645ac2e477b88cae  GIT Date: 2022-02-08 15:08:19 +
...
[24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54
[24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274
[24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218
[24]PETSC ERROR: #4 PetscSFBcastEnd() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499
[24]PETSC ERROR: #5 VecScatterEnd_Internal() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87
[24]PETSC ERROR: #6 VecScatterEnd() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366
[24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6k

[petsc-users] Sparse solvers for distributed GPU matrices/vectors arising from 3D poisson eq

2022-02-04 Thread Sajid Ali Syed
Hi PETSc-developers,

Could the linear solver table (at 
https://petsc.org/main/overview/linear_solve_table/) be updated with 
information regarding direct solvers that work on mpiaijkokkos/kokkos (or 
mpiaijcusparse/cuda) matrix/vector types?

The use case for this solver would be to repeatedly invert the same matrix so 
any solver that is able to perform the SpTRSV phase entirely using GPU 
matrices/vectors would be helpful (even if the initial factorization is 
performed using CPU matrices/vectors with GPU offload), this functionality of 
course being the corresponding distributed memory counterpart to the current 
device-solve capabilities of the seqaijkokkos matrix type (provided by the 
kokkos-kernel SpTRSV routines). The system arises from a (7-pt) finite 
difference discretization of the 3D Poisson equation with a mesh of 
256x256x1024 (likely necessitate using multiple GPUs) with dirichlet boundary 
conditions.

The recent article on PETScSF (arXiv:2102.13018) describes an asynchronous CG 
solver that works well on communication bound multi-GPU systems. Is this solver 
available now and can it be combined with GAMG/hypre preconditioning ?

Summary of Sparse Linear Solvers Available In PETSc — PETSc 
v3.16.2-540-g1213a6437a 
documentation
Last updated on 2022-01-01T03:38:46-0600 (v3.16.2-540-g1213a6437a).
petsc.org

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io