[petsc-users] How to link SuperLU_MT library in GCC_default.mk file?

2023-02-22 Thread Salman Ahmad
Dear All,

I compiled the library "SuperLU_MT" and got the "libsuperlu_mt_PTHREAD.a"
and  "libsuperlu_mt_PTHREAD.a"  to  /usr/lib to /usr/lib. I am using the
following file to link but not working:

CC = gcc
CXX = g++
F77 = gfortran
LINKER  = ${CXX}

WARNINGS = -Wall -pedantic -Wextra -Weffc++ -Woverloaded-virtual
-Wfloat-equal -Wshadow \
   -Wredundant-decls -Winline -fmax-errors=1

CXXFLAGS += -ffast-math -O3 -march=native -std=c++17 ${WARNINGS}

LINKFLAGS   += -O2

#architecture
#CPU = -march=znver2
CXXFLAGS  += ${CPU}
LINKFLAGS += ${CPU}


ifeq ($(UBUNTU),1)
LINKFLAGS += -llapack -lblas
CXXFLAGS  += -DUBUNTU
else
# on  archlinux
LINKFLAGS += -llapack -lopenblas -lcblas
endif

SANITARY =  -fsanitize=address  -fsanitize=undefined -fsanitize=null
-fsanitize=return \
 -fsanitize=bounds -fsanitize=alignment -fsanitize=float-divide-by-zero
-fsanitize=float-cast-overflow \
 -fsanitize=bool -fsanitize=enum -fsanitize=vptr

# SuperLU_MT
CXXFLAGS += -L/usr/lib/lsuperlu_mt_PTHREAD
LINKFLAGS += -L/usr/lib/lsuperlu_mt_PTHREAD
#
SUPERLU_INC= -I/usr/include/superlu -I/usr/include/superlu-dist
CXXFLAGS  += ${SUPERLU_INC}
LINKFLAGS +=-lsuperlu
# OpenMP
CXXFLAGS += -fopenmp
LINKFLAGS += -fopenmp

Any suggestions?
Best Regards,
Salman Ahmad


Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Barry Smith

  Hmm, there could be a bug in our handling of the stack when reaches the 
maximum. It is suppose to just stop collecting additional levels at that point 
but likely it has not been tested since a lot of refactorizations.

   What are you doing to have so many stack frames? 

> On Feb 22, 2023, at 6:32 PM, Sajid Ali Syed  wrote:
> 
> Hi Matt, 
> 
> Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
> cluster. 
> 
> What does prevent the crash (on my laptop at least) is changing 
> `PETSCSTACKSIZE` from 64 to 256 here : 
> https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153
> 
> 
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io 
> From: Matthew Knepley mailto:knep...@gmail.com>>
> Sent: Wednesday, February 22, 2023 5:23 PM
> To: Sajid Ali Syed mailto:sas...@fnal.gov>>
> Cc: Barry Smith mailto:bsm...@petsc.dev>>; 
> petsc-users@mcs.anl.gov  
> mailto:petsc-users@mcs.anl.gov>>
> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>  
> On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
> One thing to note in relation to the trace attached in the previous email is 
> that there are no warnings until the 36th call to KSP_Solve. The first error 
> (as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve 
> (part of what the application marks as turn 10 of the propagator). The crash 
> finally occurs on the 43rd call to KSP_solve.
> 
> Looking at the trace, it appears that stack handling is messed up and 
> eventually it causes the crash. This can happen when
> PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
> running this with
> 
>   -checkstack
> 
>   Thanks,
> 
>  Matt
>  
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io 
> 
> From: Sajid Ali Syed mailto:sas...@fnal.gov>>
> Sent: Wednesday, February 22, 2023 5:11 PM
> To: Barry Smith mailto:bsm...@petsc.dev>>
> Cc: petsc-users@mcs.anl.gov  
> mailto:petsc-users@mcs.anl.gov>>
> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>  
> Hi Barry, 
> 
> Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
> and have the following trace for the same crash (with ASAN turned on for both 
> PETSc (on the latest commit of the branch) and the application) : 
> https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940 
> 
> 
> The trace seems to indicate a couple of buffer overflows, one of which causes 
> the crash. I'm not sure as to what causes them. 
> 
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io 
> 
> From: Barry Smith mailto:bsm...@petsc.dev>>
> Sent: Wednesday, February 15, 2023 2:01 PM
> To: Sajid Ali Syed mailto:sas...@fnal.gov>>
> Cc: petsc-users@mcs.anl.gov  
> mailto:petsc-users@mcs.anl.gov>>
> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>  
> 
> https://gitlab.com/petsc/petsc/-/merge_requests/6075 
> 
>  should fix the possible recursive error condition Matt pointed out
> 
> 
>> On Feb 9, 2023, at 6:24 PM, Matthew Knepley > > wrote:
>> 
>> On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
>> mailto:petsc-users@mcs.anl.gov>> wrote:
>> I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace 
>> from lldb is in the attached file. The crash now seems to be at:
>> 
>> Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop 
>> reason = EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
>>   

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Via a checkpoint in `PetscOptionsCheckInitial_Private`, I can confirm that 
`checkstack` is set to `PETSC_TRUE` and this leads to no (additional) 
information about erroneous stack handling.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Sajid Ali Syed 
Sent: Wednesday, February 22, 2023 6:34 PM
To: Matthew Knepley 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Matt,

This is a trace from the same crash, but with `-checkstack` included in 
.petscrc​ : https://gist.github.com/s-sajid-ali/455b3982d47a31bff9e7ee211dd43991


I don't see any additional information regarding the possible cause.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Matthew Knepley 
Sent: Wednesday, February 22, 2023 6:28 PM
To: Sajid Ali Syed 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:32 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:
Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

It will not prevent a crash. The output is intended to show us where the stack 
problem originates. Can you send the output?

  Thanks,

Matt

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: Barry Smith mailto:bsm...@petsc.dev>>; 
petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research 

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Hi Matt,

This is a trace from the same crash, but with `-checkstack` included in 
.petscrc​ : https://gist.github.com/s-sajid-ali/455b3982d47a31bff9e7ee211dd43991


I don't see any additional information regarding the possible cause.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Matthew Knepley 
Sent: Wednesday, February 22, 2023 6:28 PM
To: Sajid Ali Syed 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:32 PM Sajid Ali Syed 
mailto:sas...@fnal.gov>> wrote:
Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

It will not prevent a crash. The output is intended to show us where the stack 
problem originates. Can you send the output?

  Thanks,

Matt

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: Barry Smith mailto:bsm...@petsc.dev>>; 
petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Barry Smith mailto:bsm...@petsc.dev>>
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: petsc-users@mcs.anl.gov 

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Matthew Knepley
On Wed, Feb 22, 2023 at 6:32 PM Sajid Ali Syed  wrote:

> Hi Matt,
>
> Adding `-checkstack` does not prevent the crash, both on my laptop and on
> the cluster.
>

It will not prevent a crash. The output is intended to show us where the
stack problem originates. Can you send the output?

  Thanks,

Matt


> What does prevent the crash (on my laptop at least) is changing
> `PETSCSTACKSIZE` from 64 to 256 here :
> https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153
>
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
>
> --
> *From:* Matthew Knepley 
> *Sent:* Wednesday, February 22, 2023 5:23 PM
> *To:* Sajid Ali Syed 
> *Cc:* Barry Smith ; petsc-users@mcs.anl.gov <
> petsc-users@mcs.anl.gov>
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
> On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
> One thing to note in relation to the trace attached in the previous email
> is that there are no warnings until the 36th call to KSP_Solve. The first
> error (as indicated by ASAN) occurs somewhere before the 40th call to
> KSP_Solve (part of what the application marks as turn 10 of the
> propagator). The crash finally occurs on the 43rd call to KSP_solve.
>
>
> Looking at the trace, it appears that stack handling is messed up and
> eventually it causes the crash. This can happen when
> PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try
> running this with
>
>   -checkstack
>
>   Thanks,
>
>  Matt
>
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
> 
>
> --
> *From:* Sajid Ali Syed 
> *Sent:* Wednesday, February 22, 2023 5:11 PM
> *To:* Barry Smith 
> *Cc:* petsc-users@mcs.anl.gov 
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
> Hi Barry,
>
> Thanks a lot for fixing this issue. I ran the same problem on a linux
> machine and have the following trace for the same crash (with ASAN turned
> on for both PETSc (on the latest commit of the branch) and the application)
> : https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940
> 
>
> The trace seems to indicate a couple of buffer overflows, one of which
> causes the crash. I'm not sure as to what causes them.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
> 
>
> --
> *From:* Barry Smith 
> *Sent:* Wednesday, February 15, 2023 2:01 PM
> *To:* Sajid Ali Syed 
> *Cc:* petsc-users@mcs.anl.gov 
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
>
> https://gitlab.com/petsc/petsc/-/merge_requests/6075
> 
>  should
> fix the possible recursive error condition Matt pointed out
>
>
> On Feb 9, 2023, at 6:24 PM, Matthew Knepley  wrote:
>
> On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
> I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace
> from lldb is in the attached file. The crash now seems to be at:
>
> Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop 
> reason = EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
> frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
> fd=0x, format=0x) at mprint.c:601
>598   `PetscViewerASCIISynchronizedPrintf()`, 
> `PetscSynchronizedFlush()`
>599  @*/
>600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
> format[], ...)
> -> 601  {
>602   PetscMPIInt rank;
>603
>604   PetscFunctionBegin;
> (lldb) frame info
> 

Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Hi Matt,

Adding `-checkstack` does not prevent the crash, both on my laptop and on the 
cluster.

What does prevent the crash (on my laptop at least) is changing 
`PETSCSTACKSIZE` from 64 to 256 here : 
https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Matthew Knepley 
Sent: Wednesday, February 22, 2023 5:23 PM
To: Sajid Ali Syed 
Cc: Barry Smith ; petsc-users@mcs.anl.gov 

Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.

Looking at the trace, it appears that stack handling is messed up and 
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try 
running this with

  -checkstack

  Thanks,

 Matt

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Sajid Ali Syed mailto:sas...@fnal.gov>>
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith mailto:bsm...@petsc.dev>>
Cc: petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Barry Smith mailto:bsm...@petsc.dev>>
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed mailto:sas...@fnal.gov>>
Cc: petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
   598   `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599  @*/
   600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601  {
   602   PetscMPIInt rank;
   603  
   604   PetscFunctionBegin;
(lldb) frame info
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
(lldb)



Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Matthew Knepley
On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> One thing to note in relation to the trace attached in the previous email
> is that there are no warnings until the 36th call to KSP_Solve. The first
> error (as indicated by ASAN) occurs somewhere before the 40th call to
> KSP_Solve (part of what the application marks as turn 10 of the
> propagator). The crash finally occurs on the 43rd call to KSP_solve.
>

Looking at the trace, it appears that stack handling is messed up and
eventually it causes the crash. This can happen when
PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try
running this with

  -checkstack

  Thanks,

 Matt


> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
>
> --
> *From:* Sajid Ali Syed 
> *Sent:* Wednesday, February 22, 2023 5:11 PM
> *To:* Barry Smith 
> *Cc:* petsc-users@mcs.anl.gov 
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
> Hi Barry,
>
> Thanks a lot for fixing this issue. I ran the same problem on a linux
> machine and have the following trace for the same crash (with ASAN turned
> on for both PETSc (on the latest commit of the branch) and the application)
> : https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940
>
> The trace seems to indicate a couple of buffer overflows, one of which
> causes the crash. I'm not sure as to what causes them.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
>
> --
> *From:* Barry Smith 
> *Sent:* Wednesday, February 15, 2023 2:01 PM
> *To:* Sajid Ali Syed 
> *Cc:* petsc-users@mcs.anl.gov 
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
>
> https://gitlab.com/petsc/petsc/-/merge_requests/6075
> 
>  should
> fix the possible recursive error condition Matt pointed out
>
>
> On Feb 9, 2023, at 6:24 PM, Matthew Knepley  wrote:
>
> On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
> I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace
> from lldb is in the attached file. The crash now seems to be at:
>
> Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop 
> reason = EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
> frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
> fd=0x, format=0x) at mprint.c:601
>598   `PetscViewerASCIISynchronizedPrintf()`, 
> `PetscSynchronizedFlush()`
>599  @*/
>600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
> format[], ...)
> -> 601  {
>602   PetscMPIInt rank;
>603
>604   PetscFunctionBegin;
> (lldb) frame info
> frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
> fd=0x, format=0x) at mprint.c:601
> (lldb)
>
> The trace seems to indicate some sort of infinite loop causing an overflow.
>
>
> Yes, I have also seen this. What happens is that we have a memory error.
> The error is reported inside PetscMallocValidate()
> using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls
> PetscMallocValidate again, which fails. We need to
> remove all error checking from the prints inside Validate.
>
>   Thanks,
>
>  Matt
>
>
> PS: I'm using a arm64 mac, so I don't have access to valgrind.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Scientific Computing Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
> 
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener


Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
One thing to note in relation to the trace attached in the previous email is 
that there are no warnings until the 36th call to KSP_Solve. The first error 
(as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part 
of what the application marks as turn 10 of the propagator). The crash finally 
occurs on the 43rd call to KSP_solve.


Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Sajid Ali Syed 
Sent: Wednesday, February 22, 2023 5:11 PM
To: Barry Smith 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode

Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Barry Smith 
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley  wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
   598   `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599  @*/
   600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601  {
   602   PetscMPIInt rank;
   603  
   604   PetscFunctionBegin;
(lldb) frame info
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
(lldb)


The trace seems to indicate some sort of infinite loop causing an overflow.

Yes, I have also seen this. What happens is that we have a memory error. The 
error is reported inside PetscMallocValidate()
using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls 
PetscMallocValidate again, which fails. We need to
remove all error checking from the prints inside Validate.

  Thanks,

 Matt


PS: I'm using a arm64 mac, so I don't have access to valgrind.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/



Re: [petsc-users] KSP_Solve crashes in debug mode

2023-02-22 Thread Sajid Ali Syed via petsc-users
Hi Barry,

Thanks a lot for fixing this issue. I ran the same problem on a linux machine 
and have the following trace for the same crash (with ASAN turned on for both 
PETSc (on the latest commit of the branch) and the application) : 
https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940

The trace seems to indicate a couple of buffer overflows, one of which causes 
the crash. I'm not sure as to what causes them.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


From: Barry Smith 
Sent: Wednesday, February 15, 2023 2:01 PM
To: Sajid Ali Syed 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] KSP_Solve crashes in debug mode


https://gitlab.com/petsc/petsc/-/merge_requests/6075
 should fix the possible recursive error condition Matt pointed out


On Feb 9, 2023, at 6:24 PM, Matthew Knepley  wrote:

On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from 
lldb is in the attached file. The crash now seems to be at:

Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason 
= EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
   598   `PetscViewerASCIISynchronizedPrintf()`, 
`PetscSynchronizedFlush()`
   599  @*/
   600  PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char 
format[], ...)
-> 601  {
   602   PetscMPIInt rank;
   603  
   604   PetscFunctionBegin;
(lldb) frame info
frame #0: 0x000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, 
fd=0x, format=0x) at mprint.c:601
(lldb)


The trace seems to indicate some sort of infinite loop causing an overflow.

Yes, I have also seen this. What happens is that we have a memory error. The 
error is reported inside PetscMallocValidate()
using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls 
PetscMallocValidate again, which fails. We need to
remove all error checking from the prints inside Validate.

  Thanks,

 Matt


PS: I'm using a arm64 mac, so I don't have access to valgrind.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/



Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Matthew Knepley
On Wed, Feb 22, 2023 at 5:45 PM Paul Grosse-Bley <
paul.grosse-b...@ziti.uni-heidelberg.de> wrote:

> I thought to have observed the number of cycles
> with -pc_mg_multiplicative_cycles to be dependent on rtol. But I might have
> seen this with maxits=0 which would explain my missunderstanding of
> richardson.
>
> I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I
> still see the multiple PCApplys there with maxits=1 and richardson, while
> they are gone for PCMG (due to getting rid of  KSPSetComputeInitialGuess?).
>

Yes, that is true. AMGX is a black box, so we cannot look inside the
V-cycle.

  Thanks,

 Matt


> Best,
> Paul Grosse-Bley
>
>
> On Wednesday, February 22, 2023 23:20 CET, Barry Smith 
> wrote:
>
>
>
>
>
>Preonly means exactly one application of the PC so it will never
> converge by itself unless the PC is a full solver.
>
>Note there is a PCApplyRichardson_MG() that gets used automatically
> with KSPRICHARSON. This does not have an"extra" application of the
> preconditioner so 2 iterations of Richardson with MG will use 2
> applications of the V-cycle. So it is exactly "multigrid as a solver,
> without a Krylov method", no extra work. So I don't think you need to make
> any "compromises".
>
>   Barry
>
>
>
>
> On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
> Hi again,
>
> I now found out that
>
> 1. preonly ignores -ksp_pc_side right (makes sense, I guess).
> 2. richardson is incompatible with -ksp_pc_side right.
> 3. preonly gives less output for -log_view -pc_mg_log than richardson.
> 4. preonly also ignores -ksp_rtol etc..
> 5. preonly causes -log_view to measure incorrect timings for custom
> stages, i.e. the time for the stage (219us) is significantly shorter than
> the time for the KSPSolve inside the stage (~40ms).
>
> Number 4 will be problematic as I want to benchmark number of V-cycles and
> runtime for a given rtol. At the same time I want to avoid richardson now
> because of number 2 and the additional work of scaling the RHS.
>
> Is there any good way of just using MG V-cycles as a solver, i.e. without
> interference from an outer Krylov solver and still iterate until
> convergence?
> Or will I just have to accept the additional V-cycle due to the left
> application of th PC with richardson?
>
> I guess I could also manually change -pc_mg_multiplicative_cycles until
> the residual gets low enough (using preonly), but that seems very
> inefficient.
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
>
> I was using the Richardson KSP type which I guess has the same behavior as
> GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use
> preonly from now on, where maxits=1 does what I want it to do.
>
> Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I
> think for now all my problems are resolved for now. Thank you very much,
> Barry and Mark!
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 21:03 CET, Barry Smith 
> wrote:
>
>
>
>
>
>
> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
> Hi Barry,
>
> I think most of my "weird" observations came from the fact that I looked
> at iterations of KSPSolve where the residual was already converged. PCMG
> and PCGAMG do one V-cycle before even taking a look at the residual and
> then independent of pc_mg_multiplicative_cycles stop if it is converged.
>
> Looking at iterations that are not converged with PCMG,
> pc_mg_multiplicative_cycles works fine.
>
> At these iterations I also see the multiple calls to PCApply in a single
> KSPSolve iteration which were throwing me off with PCAMGX before.
>
> The reason for these multiple applications of the preconditioner (tested
> for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This
> could be better documented, I think.
>
>
>I do not understand what you are talking about with regard to maxits of
> 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of
> meaningless.
>
>The reason that there is a PCApply at the start of the solve is because
> by default the KSPType is KSPGMRES which by default uses left
> preconditioner which means the right hand side needs to be scaled by the
> preconditioner before the KSP process starts. So in this configuration one
> KSP iteration results in 2 PCApply.  You can use -ksp_pc_side right to use
> right preconditioning and then the number of PCApply will match the number
> of KSP iterations.
>
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 20:15 CET, Barry Smith 
> wrote:
>
>
>
>
>
>
> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
> Hi Mark,
>
> I use Nvidia Nsight Systems with --trace
> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

I thought to have observed the number of cycles with 
-pc_mg_multiplicative_cycles to be dependent on rtol. But I might have seen 
this with maxits=0 which would explain my missunderstanding of richardson.

I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I still 
see the multiple PCApplys there with maxits=1 and richardson, while they are 
gone for PCMG (due to getting rid of  KSPSetComputeInitialGuess?).

Best,
Paul Grosse-Bley


On Wednesday, February 22, 2023 23:20 CET, Barry Smith  wrote:
  Preonly means exactly one application of the PC so it will never converge 
by itself unless the PC is a full solver.    Note there is a 
PCApplyRichardson_MG() that gets used automatically with KSPRICHARSON. This 
does not have an"extra" application of the preconditioner so 2 iterations of 
Richardson with MG will use 2 applications of the V-cycle. So it is exactly 
"multigrid as a solver, without a Krylov method", no extra work. So I don't 
think you need to make any "compromises".    Barry   On Feb 22, 2023, at 4:57 
PM, Paul Grosse-Bley  wrote: Hi again,

I now found out that

1. preonly ignores -ksp_pc_side right (makes sense, I guess).
2. richardson is incompatible with -ksp_pc_side right.
3. preonly gives less output for -log_view -pc_mg_log than richardson.
4. preonly also ignores -ksp_rtol etc..
5. preonly causes -log_view to measure incorrect timings for custom stages, 
i.e. the time for the stage (219us) is significantly shorter than the time for 
the KSPSolve inside the stage (~40ms).

Number 4 will be problematic as I want to benchmark number of V-cycles and 
runtime for a given rtol. At the same time I want to avoid richardson now 
because of number 2 and the additional work of scaling the RHS.

Is there any good way of just using MG V-cycles as a solver, i.e. without 
interference from an outer Krylov solver and still iterate until convergence?
Or will I just have to accept the additional V-cycle due to the left 
application of th PC with richardson?

I guess I could also manually change -pc_mg_multiplicative_cycles until the 
residual gets low enough (using preonly), but that seems very inefficient.

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" 
 wrote:
 I was using the Richardson KSP type which I guess has the same behavior as 
GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
preonly from now on, where maxits=1 does what I want it to do.

Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think 
for now all my problems are resolved for now. Thank you very much, Barry and 
Mark!

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:03 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.    I do not understand what you are talking about 
with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one 
iteration, 0 is kind of meaningless.    The reason that there is a PCApply at 
the start of the solve is because by default the KSPType is KSPGMRES which by 
default uses left preconditioner which means the right hand side needs to be 
scaled by the preconditioner before the KSP process starts. So in this 
configuration one KSP iteration results in 2 PCApply.  You can use -ksp_pc_side 
right to use right preconditioning and then the number of PCApply will match 
the number of KSP iterations.
Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith


   Preonly means exactly one application of the PC so it will never converge by 
itself unless the PC is a full solver.

   Note there is a PCApplyRichardson_MG() that gets used automatically with 
KSPRICHARSON. This does not have an"extra" application of the preconditioner so 
2 iterations of Richardson with MG will use 2 applications of the V-cycle. So 
it is exactly "multigrid as a solver, without a Krylov method", no extra work. 
So I don't think you need to make any "compromises". 

  Barry



> On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley 
>  wrote:
> 
> Hi again,
> 
> I now found out that
> 
> 1. preonly ignores -ksp_pc_side right (makes sense, I guess).
> 2. richardson is incompatible with -ksp_pc_side right.
> 3. preonly gives less output for -log_view -pc_mg_log than richardson.
> 4. preonly also ignores -ksp_rtol etc..
> 5. preonly causes -log_view to measure incorrect timings for custom stages, 
> i.e. the time for the stage (219us) is significantly shorter than the time 
> for the KSPSolve inside the stage (~40ms).
> 
> Number 4 will be problematic as I want to benchmark number of V-cycles and 
> runtime for a given rtol. At the same time I want to avoid richardson now 
> because of number 2 and the additional work of scaling the RHS.
> 
> Is there any good way of just using MG V-cycles as a solver, i.e. without 
> interference from an outer Krylov solver and still iterate until convergence?
> Or will I just have to accept the additional V-cycle due to the left 
> application of th PC with richardson?
> 
> I guess I could also manually change -pc_mg_multiplicative_cycles until the 
> residual gets low enough (using preonly), but that seems very inefficient.
> 
> Best,
> Paul Große-Bley
> 
> 
> 
> On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" 
>  wrote:
>  
>> 
>> I was using the Richardson KSP type which I guess has the same behavior as 
>> GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
>> preonly from now on, where maxits=1 does what I want it to do.
>> 
>> Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I 
>> think for now all my problems are resolved for now. Thank you very much, 
>> Barry and Mark!
>> 
>> Best,
>> Paul Große-Bley
>> 
>> 
>> 
>> On Wednesday, February 22, 2023 21:03 CET, Barry Smith  
>> wrote:
>>  
>>> 
>>>  
>>  
>>> 
>>> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
>>>  wrote:
>>>  
>>> Hi Barry,
>>> 
>>> I think most of my "weird" observations came from the fact that I looked at 
>>> iterations of KSPSolve where the residual was already converged. PCMG and 
>>> PCGAMG do one V-cycle before even taking a look at the residual and then 
>>> independent of pc_mg_multiplicative_cycles stop if it is converged.
>>> 
>>> Looking at iterations that are not converged with PCMG, 
>>> pc_mg_multiplicative_cycles works fine.
>>> 
>>> At these iterations I also see the multiple calls to PCApply in a single 
>>> KSPSolve iteration which were throwing me off with PCAMGX before.
>>> 
>>> The reason for these multiple applications of the preconditioner (tested 
>>> for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This 
>>> could be better documented, I think.
>>  
>>I do not understand what you are talking about with regard to maxits of 1 
>> instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of 
>> meaningless.
>>  
>>The reason that there is a PCApply at the start of the solve is because 
>> by default the KSPType is KSPGMRES which by default uses left preconditioner 
>> which means the right hand side needs to be scaled by the preconditioner 
>> before the KSP process starts. So in this configuration one KSP iteration 
>> results in 2 PCApply.  You can use -ksp_pc_side right to use right 
>> preconditioning and then the number of PCApply will match the number of KSP 
>> iterations.
>>> 
>>> 
>>> Best,
>>> Paul Große-Bley
>>> 
>>> 
>>> 
>>> On Wednesday, February 22, 2023 20:15 CET, Barry Smith  
>>> wrote:
>>>  
 
  
>>>  
 
 On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
  wrote:
  
 Hi Mark,
 
 I use Nvidia Nsight Systems with --trace 
 cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX 
 markers that come with -log_view. I.e. I get a nice view of all cuBLAS and 
 cuSPARSE calls (in addition to the actual kernels which are not always 
 easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even 
 more detailed NVTX markers.
 
 The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear 
 because kernel runtimes on coarser levels are much shorter. At the 
 coarsest level, there normally isn't even enough work for the GPU (Nvidia 
 A100) to be fully occupied which is also visible in Nsight Systems.
>>>  
>>>   Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most 
>>> definitely it changes the run. I am not understanding why it would not 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Matthew Knepley
On Wed, Feb 22, 2023 at 4:57 PM Paul Grosse-Bley <
paul.grosse-b...@ziti.uni-heidelberg.de> wrote:

> Hi again,
>
> I now found out that
>
> 1. preonly ignores -ksp_pc_side right (makes sense, I guess).
> 2. richardson is incompatible with -ksp_pc_side right.
> 3. preonly gives less output for -log_view -pc_mg_log than richardson.
> 4. preonly also ignores -ksp_rtol etc..
> 5. preonly causes -log_view to measure incorrect timings for custom
> stages, i.e. the time for the stage (219us) is significantly shorter than
> the time for the KSPSolve inside the stage (~40ms).
>

I think there is a misunderstanding about KSPPREONLY. This applies the
preconditioner once and does nothing else. That is why it ignores the
tolerances, and viewers, etc. This is normally used with an exact
factorization like LU to remove any Krylov overhead.

If you want several iterations of the preconditioner, then you want
Richardson. This is just steepest descent on the preconditioned
operator. In this case, the initial application of a V-cycle is not "extra"
in that you use that residual as the descent direction, and it
is the one you want.

  Thanks,

Matt


> Number 4 will be problematic as I want to benchmark number of V-cycles and
> runtime for a given rtol. At the same time I want to avoid richardson now
> because of number 2 and the additional work of scaling the RHS.
>
> Is there any good way of just using MG V-cycles as a solver, i.e. without
> interference from an outer Krylov solver and still iterate until
> convergence?
> Or will I just have to accept the additional V-cycle due to the left
> application of th PC with richardson?
>
> I guess I could also manually change -pc_mg_multiplicative_cycles until
> the residual gets low enough (using preonly), but that seems very
> inefficient.
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
>
> I was using the Richardson KSP type which I guess has the same behavior as
> GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use
> preonly from now on, where maxits=1 does what I want it to do.
>
> Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I
> think for now all my problems are resolved for now. Thank you very much,
> Barry and Mark!
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 21:03 CET, Barry Smith 
> wrote:
>
>
>
>
>
>
> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
> Hi Barry,
>
> I think most of my "weird" observations came from the fact that I looked
> at iterations of KSPSolve where the residual was already converged. PCMG
> and PCGAMG do one V-cycle before even taking a look at the residual and
> then independent of pc_mg_multiplicative_cycles stop if it is converged.
>
> Looking at iterations that are not converged with PCMG,
> pc_mg_multiplicative_cycles works fine.
>
> At these iterations I also see the multiple calls to PCApply in a single
> KSPSolve iteration which were throwing me off with PCAMGX before.
>
> The reason for these multiple applications of the preconditioner (tested
> for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This
> could be better documented, I think.
>
>
>I do not understand what you are talking about with regard to maxits of
> 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of
> meaningless.
>
>The reason that there is a PCApply at the start of the solve is because
> by default the KSPType is KSPGMRES which by default uses left
> preconditioner which means the right hand side needs to be scaled by the
> preconditioner before the KSP process starts. So in this configuration one
> KSP iteration results in 2 PCApply.  You can use -ksp_pc_side right to use
> right preconditioning and then the number of PCApply will match the number
> of KSP iterations.
>
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 20:15 CET, Barry Smith 
> wrote:
>
>
>
>
>
>
> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
> Hi Mark,
>
> I use Nvidia Nsight Systems with --trace
> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX
> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and
> cuSPARSE calls (in addition to the actual kernels which are not always easy
> to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more
> detailed NVTX markers.
>
> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear
> because kernel runtimes on coarser levels are much shorter. At the coarsest
> level, there normally isn't even enough work for the GPU (Nvidia A100) to
> be fully occupied which is also visible in Nsight Systems.
>
>
>   Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most
> definitely it changes the run. I am not understanding why it would not work
> 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi again,

I now found out that

1. preonly ignores -ksp_pc_side right (makes sense, I guess).
2. richardson is incompatible with -ksp_pc_side right.
3. preonly gives less output for -log_view -pc_mg_log than richardson.
4. preonly also ignores -ksp_rtol etc..
5. preonly causes -log_view to measure incorrect timings for custom stages, 
i.e. the time for the stage (219us) is significantly shorter than the time for 
the KSPSolve inside the stage (~40ms).

Number 4 will be problematic as I want to benchmark number of V-cycles and 
runtime for a given rtol. At the same time I want to avoid richardson now 
because of number 2 and the additional work of scaling the RHS.

Is there any good way of just using MG V-cycles as a solver, i.e. without 
interference from an outer Krylov solver and still iterate until convergence?
Or will I just have to accept the additional V-cycle due to the left 
application of th PC with richardson?

I guess I could also manually change -pc_mg_multiplicative_cycles until the 
residual gets low enough (using preonly), but that seems very inefficient.

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" 
 wrote:
 I was using the Richardson KSP type which I guess has the same behavior as 
GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
preonly from now on, where maxits=1 does what I want it to do.

Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think 
for now all my problems are resolved for now. Thank you very much, Barry and 
Mark!

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:03 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.    I do not understand what you are talking about 
with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one 
iteration, 0 is kind of meaningless.    The reason that there is a PCApply at 
the start of the solve is because by default the KSPType is KSPGMRES which by 
default uses left preconditioner which means the right hand side needs to be 
scaled by the preconditioner before the KSP process starts. So in this 
configuration one KSP iteration results in 2 PCApply.  You can use -ksp_pc_side 
right to use right preconditioning and then the number of PCApply will match 
the number of KSP iterations.
Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 
-pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not 
understanding why it would not work for you. If you use and don't use the 
option are the exact same counts listed for all the events in the -log_view ? 
I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

I was using the Richardson KSP type which I guess has the same behavior as 
GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
preonly from now on, where maxits=1 does what I want it to do.

Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think 
for now all my problems are resolved for now. Thank you very much, Barry and 
Mark!

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:03 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.    I do not understand what you are talking about 
with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one 
iteration, 0 is kind of meaningless.    The reason that there is a PCApply at 
the start of the solve is because by default the KSPType is KSPGMRES which by 
default uses left preconditioner which means the right hand side needs to be 
scaled by the preconditioner before the KSP process starts. So in this 
configuration one KSP iteration results in 2 PCApply.  You can use -ksp_pc_side 
right to use right preconditioning and then the number of PCApply will match 
the number of KSP iterations.
Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 
-pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not 
understanding why it would not work for you. If you use and don't use the 
option are the exact same counts listed for all the events in the -log_view ? 
I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Barry,

the picture keeps getting clearer. I did not use KSPSetInitialGuessNonzero or 
the corresponding option, but using KSPSetComputeInitialGuess probably sets it 
automatically (without telling one in the output of -help).

I was also confused by the preonly KSP type not working which is also caused by 
this. I think ex45 should be changed to use a non-zero initial guess (plus 
maybe a comment mentioning that one should avoid KSPSetComputeInitialGuess when 
using the zero initial guess).

Thank you for asking the right questions,
Paul Große-Bley



On Wednesday, February 22, 2023 20:46 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:19 PM, Paul Grosse-Bley 
 wrote: Hi again,

after checking with -ksp_monitor for PCMG, it seems my assumption that I could 
reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas 
generally wrong and BoomerAMG was just the only preconditioner that cleverly 
stops doing work when the residual is already converged (which then caused me 
to find the shrinking residual and to think that it was doing something 
different to the other methods).    Depending on your example the 
KSPSetComputeInitialGuess usage might not work. I suggest not using it for your 
benchmarking.    But if you just call KSPSolve() multiple times it will zero 
the solution at the beginning of each KSPSolve() (unless you use 
KSPSetInitialGuessNonzero()). So if you want to run multiple solves for testing 
you should not need to do anything. This will be true for any use of KSPSolve() 
whether the PC is from PETSc, hypre or NVIDIA.   
So, how can I get KSP to use the function given through 
KSPSetComputeInitialGuess to reset the solution vector (without calling 
KSPReset which would add a lot of overhead, I assume)?

Best,
Paul Große-Bley


On Wednesday, February 22, 2023 19:46 CET, Mark Adams  wrote:
 OK, Nsight Systems is a good way to see what is going on. So all three of your 
solvers are not traversing the MG hierching with the correct logic.I don't know 
about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx 
library directly from out interface.Some things to try:* Use -options_left to 
make sure your options are being used (eg, spelling mistakes)* Use -ksp_view to 
see a human readable list of your solver parameters.* Use -log_trace to see if 
the correct methods are called. - PCMG calls PCMGMCycle_Private for each of the 
cycle in code like:   for (i = 0; i < mg->cyclesperpcapply; i++) 
PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, 
NULL));- AMGx is called PCApply_AMGX and then it dives into the library. See 
where these three calls to AMGx are called from. Mark On Wed, Feb 22, 2023 at 
1:10 PM Paul Grosse-Bley  wrote:Hi 
Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith


> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
>  wrote:
> 
> Hi Barry,
> 
> I think most of my "weird" observations came from the fact that I looked at 
> iterations of KSPSolve where the residual was already converged. PCMG and 
> PCGAMG do one V-cycle before even taking a look at the residual and then 
> independent of pc_mg_multiplicative_cycles stop if it is converged.
> 
> Looking at iterations that are not converged with PCMG, 
> pc_mg_multiplicative_cycles works fine.
> 
> At these iterations I also see the multiple calls to PCApply in a single 
> KSPSolve iteration which were throwing me off with PCAMGX before.
> 
> The reason for these multiple applications of the preconditioner (tested for 
> both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could 
> be better documented, I think.

   I do not understand what you are talking about with regard to maxits of 1 
instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of meaningless.

   The reason that there is a PCApply at the start of the solve is because by 
default the KSPType is KSPGMRES which by default uses left preconditioner which 
means the right hand side needs to be scaled by the preconditioner before the 
KSP process starts. So in this configuration one KSP iteration results in 2 
PCApply.  You can use -ksp_pc_side right to use right preconditioning and then 
the number of PCApply will match the number of KSP iterations.
> 
> Best,
> Paul Große-Bley
> 
> 
> 
> On Wednesday, February 22, 2023 20:15 CET, Barry Smith  
> wrote:
>  
>> 
>>  
>  
>> 
>> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
>>  wrote:
>>  
>> Hi Mark,
>> 
>> I use Nvidia Nsight Systems with --trace 
>> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX 
>> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and 
>> cuSPARSE calls (in addition to the actual kernels which are not always easy 
>> to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more 
>> detailed NVTX markers.
>> 
>> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear 
>> because kernel runtimes on coarser levels are much shorter. At the coarsest 
>> level, there normally isn't even enough work for the GPU (Nvidia A100) to be 
>> fully occupied which is also visible in Nsight Systems.
>  
>   Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most 
> definitely it changes the run. I am not understanding why it would not work 
> for you. If you use and don't use the option are the exact same counts listed 
> for all the events in the -log_view ? 
>> 
>> 
>> I run only a single MPI rank with a single GPU, so profiling is 
>> straighforward.
>> 
>> Best,
>> Paul Große-Bley
>> 
>> On Wednesday, February 22, 2023 18:24 CET, Mark Adams  
>> wrote:
>>  
>>> 
>>>  
>>>  
>>> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
>>> >> > wrote:
 Hi Barry,
 
 after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
 called as part of KSPSolve instead of KSPSetup, but its runtime is way 
 less significant than the cudaMemcpy before, so I guess I will leave it 
 like this. Other than that I kept the code like in my first message in 
 this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
 
 The profiling results for PCMG and PCAMG look as I would expect them to 
 look, i.e. one can nicely see the GPU load/kernel runtimes going down and 
 up again for each V-cycle.
 
 I was wondering about -pc_mg_multiplicative_cycles as it does not seem to 
 make any difference. I would have expected to be able to increase the 
 number of V-cycles per KSP iteration, but I keep seeing just a single 
 V-cycle when changing the option (using PCMG).
>>>  
>>> How are you seeing this? 
>>> You might try -log_trace to see if you get two V cycles.
>>>  
 
 When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess 
 between bench iterations to reset the solution vector does not seem to 
 work as the residual keeps shrinking. Is this a bug? Any advice for 
 working around this?
  
>>>  
>>> Looking at the doc 
>>> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ 
>>> you use this with  KSPSetComputeRHS.
>>>  
>>> In src/snes/tests/ex13.c I just zero out the solution vector.
>>>   
 The profile for BoomerAMG also doesn't really show the V-cycle behavior of 
 the other implementations. Most of the runtime seems to go into calls to 
 cusparseDcsrsv which might happen at the different levels, but the runtime 
 of these kernels doesn't show the V-cycle pattern. According to the output 
 with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
 though, so I guess it is alright (and if not, this is probably the wrong 
 place to discuss it).
 
 When using PCAMGX, I see two 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 
-pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not 
understanding why it would not work for you. If you use and don't use the 
option are the exact same counts listed for all the events in the -log_view ? 
I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of their 
performance.   On Feb 6, 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith


> On Feb 22, 2023, at 2:19 PM, Paul Grosse-Bley 
>  wrote:
> 
> Hi again,
> 
> after checking with -ksp_monitor for PCMG, it seems my assumption that I 
> could reset the solution by calling KSPSetComputeInitialGuess and then 
> KSPSetupwas generally wrong and BoomerAMG was just the only preconditioner 
> that cleverly stops doing work when the residual is already converged (which 
> then caused me to find the shrinking residual and to think that it was doing 
> something different to the other methods).

   Depending on your example the KSPSetComputeInitialGuess usage might not 
work. I suggest not using it for your benchmarking.

   But if you just call KSPSolve() multiple times it will zero the solution at 
the beginning of each KSPSolve() (unless you use KSPSetInitialGuessNonzero()). 
So if you want to run multiple solves for testing you should not need to do 
anything. This will be true for any use of KSPSolve() whether the PC is from 
PETSc, hypre or NVIDIA.

  
> 
> So, how can I get KSP to use the function given through 
> KSPSetComputeInitialGuess to reset the solution vector (without calling 
> KSPReset which would add a lot of overhead, I assume)?
> 
> Best,
> Paul Große-Bley
> 
> 
> On Wednesday, February 22, 2023 19:46 CET, Mark Adams  wrote:
>  
>> 
>> OK, Nsight Systems is a good way to see what is going on.
>>  
>> So all three of your solvers are not traversing the MG hierching with the 
>> correct logic.
>> I don't know about hypre but PCMG and AMGx are pretty simple and AMGx dives 
>> into the AMGx library directly from out interface.
>> Some things to try:
>> * Use -options_left to make sure your options are being used (eg, spelling 
>> mistakes)
>> * Use -ksp_view to see a human readable list of your solver parameters.
>> * Use -log_trace to see if the correct methods are called.
>>  - PCMG calls PCMGMCycle_Private for each of the cycle in code like:
>>for (i = 0; i < mg->cyclesperpcapply; i++) 
>> PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, 
>> NULL));
>> - AMGx is called PCApply_AMGX and then it dives into the library. See where 
>> these three calls to AMGx are called from.
>>  
>> Mark
>>  
>> On Wed, Feb 22, 2023 at 1:10 PM Paul Grosse-Bley 
>> > > wrote:
>>> Hi Mark,
>>> 
>>> I use Nvidia Nsight Systems with --trace 
>>> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX 
>>> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and 
>>> cuSPARSE calls (in addition to the actual kernels which are not always easy 
>>> to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more 
>>> detailed NVTX markers.
>>> 
>>> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear 
>>> because kernel runtimes on coarser levels are much shorter. At the coarsest 
>>> level, there normally isn't even enough work for the GPU (Nvidia A100) to 
>>> be fully occupied which is also visible in Nsight Systems.
>>> 
>>> I run only a single MPI rank with a single GPU, so profiling is 
>>> straighforward.
>>> 
>>> Best,
>>> Paul Große-Bley
>>> 
>>> On Wednesday, February 22, 2023 18:24 CET, Mark Adams >> > wrote:
>>>  
 
  
  
 On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 >>> > wrote:
> Hi Barry,
> 
> after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
> called as part of KSPSolve instead of KSPSetup, but its runtime is way 
> less significant than the cudaMemcpy before, so I guess I will leave it 
> like this. Other than that I kept the code like in my first message in 
> this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
> 
> The profiling results for PCMG and PCAMG look as I would expect them to 
> look, i.e. one can nicely see the GPU load/kernel runtimes going down and 
> up again for each V-cycle.
> 
> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to 
> make any difference. I would have expected to be able to increase the 
> number of V-cycles per KSP iteration, but I keep seeing just a single 
> V-cycle when changing the option (using PCMG).
  
 How are you seeing this? 
 You might try -log_trace to see if you get two V cycles.
  
> 
> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess 
> between bench iterations to reset the solution vector does not seem to 
> work as the residual keeps shrinking. Is this a bug? Any advice for 
> working around this?
>  
  
 Looking at the doc 
 https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ 
 you use this with  KSPSetComputeRHS.
  
 In src/snes/tests/ex13.c I just zero out the solution vector.
   
> The profile for BoomerAMG also doesn't really show the V-cycle behavior 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi again,

after checking with -ksp_monitor for PCMG, it seems my assumption that I could 
reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas 
generally wrong and BoomerAMG was just the only preconditioner that cleverly 
stops doing work when the residual is already converged (which then caused me 
to find the shrinking residual and to think that it was doing something 
different to the other methods).

So, how can I get KSP to use the function given through 
KSPSetComputeInitialGuess to reset the solution vector (without calling 
KSPReset which would add a lot of overhead, I assume)?

Best,
Paul Große-Bley


On Wednesday, February 22, 2023 19:46 CET, Mark Adams  wrote:
 OK, Nsight Systems is a good way to see what is going on. So all three of your 
solvers are not traversing the MG hierching with the correct logic.I don't know 
about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx 
library directly from out interface.Some things to try:* Use -options_left to 
make sure your options are being used (eg, spelling mistakes)* Use -ksp_view to 
see a human readable list of your solver parameters.* Use -log_trace to see if 
the correct methods are called. - PCMG calls PCMGMCycle_Private for each of the 
cycle in code like:   for (i = 0; i < mg->cyclesperpcapply; i++) 
PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, 
NULL));- AMGx is called PCApply_AMGX and then it dives into the library. See 
where these three calls to AMGx are called from. Mark On Wed, Feb 22, 2023 at 
1:10 PM Paul Grosse-Bley  wrote:Hi 
Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith


> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
>  wrote:
> 
> Hi Mark,
> 
> I use Nvidia Nsight Systems with --trace 
> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
> that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
> calls (in addition to the actual kernels which are not always easy to 
> attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
> NVTX markers.
> 
> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear 
> because kernel runtimes on coarser levels are much shorter. At the coarsest 
> level, there normally isn't even enough work for the GPU (Nvidia A100) to be 
> fully occupied which is also visible in Nsight Systems.

  Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most definitely 
it changes the run. I am not understanding why it would not work for you. If 
you use and don't use the option are the exact same counts listed for all the 
events in the -log_view ? 
> 
> 
> I run only a single MPI rank with a single GPU, so profiling is 
> straighforward.
> 
> Best,
> Paul Große-Bley
> 
> On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
>  
>> 
>>  
>>  
>> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
>> > > wrote:
>>> Hi Barry,
>>> 
>>> after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
>>> called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
>>> significant than the cudaMemcpy before, so I guess I will leave it like 
>>> this. Other than that I kept the code like in my first message in this 
>>> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
>>> 
>>> The profiling results for PCMG and PCAMG look as I would expect them to 
>>> look, i.e. one can nicely see the GPU load/kernel runtimes going down and 
>>> up again for each V-cycle.
>>> 
>>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to 
>>> make any difference. I would have expected to be able to increase the 
>>> number of V-cycles per KSP iteration, but I keep seeing just a single 
>>> V-cycle when changing the option (using PCMG).
>>  
>> How are you seeing this? 
>> You might try -log_trace to see if you get two V cycles.
>>  
>>> 
>>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess 
>>> between bench iterations to reset the solution vector does not seem to work 
>>> as the residual keeps shrinking. Is this a bug? Any advice for working 
>>> around this?
>>>  
>>  
>> Looking at the doc 
>> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ 
>> you use this with  KSPSetComputeRHS.
>>  
>> In src/snes/tests/ex13.c I just zero out the solution vector.
>>   
>>> The profile for BoomerAMG also doesn't really show the V-cycle behavior of 
>>> the other implementations. Most of the runtime seems to go into calls to 
>>> cusparseDcsrsv which might happen at the different levels, but the runtime 
>>> of these kernels doesn't show the V-cycle pattern. According to the output 
>>> with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
>>> though, so I guess it is alright (and if not, this is probably the wrong 
>>> place to discuss it).
>>> 
>>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
>>> calls in KSPSolve (three for the very first KSPSolve) while expecting just 
>>> one. Each KSPSolve should do a single preconditioned Richardson iteration. 
>>> Why is the preconditioner applied multiple times here?
>>>  
>>  
>> Again, not sure what "see" is, but PCAMGX is pretty new and has not been 
>> used much.
>> Note some KSP methods apply to the PC before the iteration.
>>  
>> Mark 
>>  
>>> Thank you,
>>> Paul Große-Bley
>>> 
>>> 
>>> On Monday, February 06, 2023 20:05 CET, Barry Smith >> > wrote:
>>>  
 
  
>>>  
>>>  It should not crash, take a look at the test cases at the bottom of the 
>>> file. You are likely correct if the code, unfortunately, does use 
>>> DMCreateMatrix() it will not work out of the box for geometric multigrid. 
>>> So it might be the wrong example for you.
>>>  
>>>   I don't know what you mean about clever. If you simply set the solution 
>>> to zero at the beginning of the loop then it will just do the same solve 
>>> multiple times. The setup should not do much of anything after the first 
>>> solver.  Thought usually solves are big enough that one need not run solves 
>>> multiple times to get a good understanding of their performance.
>>>  
>>>  
>>>   
>>>  
>>>  
>>>  
 
 On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
 >>> > wrote:
  
 Hi Barry,
 
 src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting 
 point, thank you! Sadly I get a segfault when executing that example with 
 PCMG and more than one level, i.e. 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Mark Adams
OK, Nsight Systems is a good way to see what is going on.

So all three of your solvers are not traversing the MG hierching with the
correct logic.
I don't know about hypre but PCMG and AMGx are pretty simple and AMGx dives
into the AMGx library directly from out interface.
Some things to try:
* Use -options_left to make sure your options are being used (eg, spelling
mistakes)
* Use -ksp_view to see a human readable list of your solver parameters.
* Use -log_trace to see if the correct methods are called.
 - PCMG calls PCMGMCycle_Private for each of the cycle in code like:
   for (i = 0; i < mg->cyclesperpcapply; i++)
PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp,
NULL));
- AMGx is called PCApply_AMGX and then it dives into the library. See where
these three calls to AMGx are called from.

Mark

On Wed, Feb 22, 2023 at 1:10 PM Paul Grosse-Bley <
paul.grosse-b...@ziti.uni-heidelberg.de> wrote:

> Hi Mark,
>
> I use Nvidia Nsight Systems with --trace
> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX
> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and
> cuSPARSE calls (in addition to the actual kernels which are not always easy
> to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more
> detailed NVTX markers.
>
> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear
> because kernel runtimes on coarser levels are much shorter. At the coarsest
> level, there normally isn't even enough work for the GPU (Nvidia A100) to
> be fully occupied which is also visible in Nsight Systems.
>
> I run only a single MPI rank with a single GPU, so profiling is
> straighforward.
>
> Best,
> Paul Große-Bley
>
> On Wednesday, February 22, 2023 18:24 CET, Mark Adams 
> wrote:
>
>
>
>
> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
>> Hi Barry,
>>
>> after using VecCUDAGetArray to initialize the RHS, that kernel still gets
>> called as part of KSPSolve instead of KSPSetup, but its runtime is way less
>> significant than the cudaMemcpy before, so I guess I will leave it like
>> this. Other than that I kept the code like in my first message in this
>> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
>>
>> The profiling results for PCMG and PCAMG look as I would expect them to
>> look, i.e. one can nicely see the GPU load/kernel runtimes going down and
>> up again for each V-cycle.
>>
>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to
>> make any difference. I would have expected to be able to increase the
>> number of V-cycles per KSP iteration, but I keep seeing just a single
>> V-cycle when changing the option (using PCMG).
>
>
> How are you seeing this?
> You might try -log_trace to see if you get two V cycles.
>
>
>>
>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess
>> between bench iterations to reset the solution vector does not seem to work
>> as the residual keeps shrinking. Is this a bug? Any advice for working
>> around this?
>>
>
>
> Looking at the doc
> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/
> you use this with  KSPSetComputeRHS.
>
> In src/snes/tests/ex13.c I just zero out the solution vector.
>
>
>> The profile for BoomerAMG also doesn't really show the V-cycle behavior
>> of the other implementations. Most of the runtime seems to go into calls to
>> cusparseDcsrsv which might happen at the different levels, but the runtime
>> of these kernels doesn't show the V-cycle pattern. According to the output
>> with -pc_hypre_boomeramg_print_statistics it is doing the right thing
>> though, so I guess it is alright (and if not, this is probably the wrong
>> place to discuss it).
>
>
>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle
>> behavior) calls in KSPSolve (three for the very first KSPSolve) while
>> expecting just one. Each KSPSolve should do a single preconditioned
>> Richardson iteration. Why is the preconditioner applied multiple times here?
>>
>
>
> Again, not sure what "see" is, but PCAMGX is pretty new and has not been
> used much.
> Note some KSP methods apply to the PC before the iteration.
>
> Mark
>
>
>> Thank you,
>> Paul Große-Bley
>>
>>
>> On Monday, February 06, 2023 20:05 CET, Barry Smith 
>> wrote:
>>
>>
>>
>>
>>
>>  It should not crash, take a look at the test cases at the bottom of the
>> file. You are likely correct if the code, unfortunately, does use
>> DMCreateMatrix() it will not work out of the box for geometric multigrid.
>> So it might be the wrong example for you.
>>
>>   I don't know what you mean about clever. If you simply set the solution
>> to zero at the beginning of the loop then it will just do the same solve
>> multiple times. The setup should not do much of anything after the first
>> solver.  Thought usually solves are big enough that one need not run solves
>> multiple times to get a good 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of their 
performance.   On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, 
thank you! Sadly I get a segfault when executing that example with PCMG and 
more than one level, i.e. with the minimal args:

$ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
===
Test: KSP performance - Poisson
    Input matrix: 27-pt finite difference stencil
    -n 100
    DoFs = 100
    Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
[0]PETSC ERROR: 

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure using 

Re: [petsc-users] Problem setting Fieldsplit fields

2023-02-22 Thread Nicolas Barnafi via petsc-users

Hi Matt,

Sorry for the late answer, it was holiday time.


Just to clarify, if you call SetIS() 3 times, and then give

  -pc_fieldsplit_0_fields 0,2

then we should reduce the number of fields to two by calling 
ISConcatenate() on the first and last ISes?


Exactly

I think this should not be hard. It will work exactly as it does on 
the DM case, except the ISes will come from
the PC, not the DM. One complication is that you will have to hold the 
new ISes until the end, and then set them.


   Thanks,

     Matt


Nice, then it is exactly what I want. I will work on it, and create a PR 
when things are starting to fit in.


Best,
NB

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Mark Adams
On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley <
paul.grosse-b...@ziti.uni-heidelberg.de> wrote:

> Hi Barry,
>
> after using VecCUDAGetArray to initialize the RHS, that kernel still gets
> called as part of KSPSolve instead of KSPSetup, but its runtime is way less
> significant than the cudaMemcpy before, so I guess I will leave it like
> this. Other than that I kept the code like in my first message in this
> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
>
> The profiling results for PCMG and PCAMG look as I would expect them to
> look, i.e. one can nicely see the GPU load/kernel runtimes going down and
> up again for each V-cycle.
>
> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to
> make any difference. I would have expected to be able to increase the
> number of V-cycles per KSP iteration, but I keep seeing just a single
> V-cycle when changing the option (using PCMG).
>

How are you seeing this?
You might try -log_trace to see if you get two V cycles.


>
> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess
> between bench iterations to reset the solution vector does not seem to work
> as the residual keeps shrinking. Is this a bug? Any advice for working
> around this?
>
>
Looking at the doc
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/
you use this with  KSPSetComputeRHS.

In src/snes/tests/ex13.c I just zero out the solution vector.


> The profile for BoomerAMG also doesn't really show the V-cycle behavior of
> the other implementations. Most of the runtime seems to go into calls to
> cusparseDcsrsv which might happen at the different levels, but the runtime
> of these kernels doesn't show the V-cycle pattern. According to the output
> with -pc_hypre_boomeramg_print_statistics it is doing the right thing
> though, so I guess it is alright (and if not, this is probably the wrong
> place to discuss it).


> When using PCAMGX, I see two PCApply (each showing a nice V-cycle
> behavior) calls in KSPSolve (three for the very first KSPSolve) while
> expecting just one. Each KSPSolve should do a single preconditioned
> Richardson iteration. Why is the preconditioner applied multiple times here?
>
>
Again, not sure what "see" is, but PCAMGX is pretty new and has not been
used much.
Note some KSP methods apply to the PC before the iteration.

Mark


> Thank you,
> Paul Große-Bley
>
>
> On Monday, February 06, 2023 20:05 CET, Barry Smith 
> wrote:
>
>
>
>
>
>  It should not crash, take a look at the test cases at the bottom of the
> file. You are likely correct if the code, unfortunately, does use
> DMCreateMatrix() it will not work out of the box for geometric multigrid.
> So it might be the wrong example for you.
>
>   I don't know what you mean about clever. If you simply set the solution
> to zero at the beginning of the loop then it will just do the same solve
> multiple times. The setup should not do much of anything after the first
> solver.  Thought usually solves are big enough that one need not run solves
> multiple times to get a good understanding of their performance.
>
>
>
>
>
>
>
> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley <
> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>
> Hi Barry,
>
> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting
> point, thank you! Sadly I get a segfault when executing that example with
> PCMG and more than one level, i.e. with the minimal args:
>
> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
> ===
> Test: KSP performance - Poisson
> Input matrix: 27-pt finite difference stencil
> -n 100
> DoFs = 100
> Number of nonzeros = 26463592
>
> Step1  - creating Vecs and Mat...
> Step2a - running PCSetUp()...
> [0]PETSC ERROR:
> 
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [0]PETSC ERROR: or try
> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
> systems to find memory corruption errors
> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and
> run
> [0]PETSC ERROR: to get more information on the crash.
> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is
> causing the crash.
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 59.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
>
> As 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Mark Adams
On Tue, Feb 7, 2023 at 6:40 AM Matthew Knepley  wrote:

> On Tue, Feb 7, 2023 at 6:23 AM Mark Adams  wrote:
>
>> I do one complete solve to get everything setup, to be safe.
>>
>> src/ts/tutorials/ex13.c does this and runs multiple solves, if you like
>> but one solve is probably fine.
>>
>
> I think that is SNES ex13
>

Yes, it is src/snes/tests/ex13.c


>
>   Matt
>
>
>> This was designed as a benchmark and is nice because it can do any order
>> FE solve of Poisson (uses DM/PetscFE, slow).
>> src/ksp/ksp/tutorials/ex56.c is old school, hardwired for elasticity but
>> is simpler and the setup is faster if you are doing large problems per MPI
>> process.
>>
>> Mark
>>
>> On Mon, Feb 6, 2023 at 2:06 PM Barry Smith  wrote:
>>
>>>
>>>  It should not crash, take a look at the test cases at the bottom of the
>>> file. You are likely correct if the code, unfortunately, does use
>>> DMCreateMatrix() it will not work out of the box for geometric multigrid.
>>> So it might be the wrong example for you.
>>>
>>>   I don't know what you mean about clever. If you simply set the
>>> solution to zero at the beginning of the loop then it will just do the same
>>> solve multiple times. The setup should not do much of anything after the
>>> first solver.  Thought usually solves are big enough that one need not run
>>> solves multiple times to get a good understanding of their performance.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley <
>>> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>>>
>>> Hi Barry,
>>>
>>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting
>>> point, thank you! Sadly I get a segfault when executing that example with
>>> PCMG and more than one level, i.e. with the minimal args:
>>>
>>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
>>> ===
>>> Test: KSP performance - Poisson
>>> Input matrix: 27-pt finite difference stencil
>>> -n 100
>>> DoFs = 100
>>> Number of nonzeros = 26463592
>>>
>>> Step1  - creating Vecs and Mat...
>>> Step2a - running PCSetUp()...
>>> [0]PETSC ERROR:
>>> 
>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably memory access out of range
>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
>>> https://petsc.org/release/faq/
>>> [0]PETSC ERROR: or try
>>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
>>> systems to find memory corruption errors
>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>>> and run
>>> [0]PETSC ERROR: to get more information on the crash.
>>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is
>>> causing the crash.
>>>
>>> --
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 59.
>>>
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>>
>>> --
>>>
>>> As the matrix is not created using DMDACreate3d I expected it to fail
>>> due to the missing geometric information, but I expected it to fail more
>>> gracefully than with a segfault.
>>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG
>>> preconditioning, especially since I am interested in the 7pt stencil for
>>> now.
>>>
>>> Concerning my benchmarking loop from before: Is it generally discouraged
>>> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the
>>> work when doing the same solve multiple times or are the solves not
>>> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to
>>> keep the runtime short?
>>>
>>> Thanks,
>>> Paul
>>>
>>> On Monday, February 06, 2023 17:04 CET, Barry Smith 
>>> wrote:
>>>
>>>
>>>
>>>
>>>
>>>   Paul,
>>>
>>>I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to
>>> be used for simple benchmarking.
>>>
>>>You can use VecCudaGetArray() to access the GPU memory of the vector
>>> and then call a CUDA kernel to compute the right hand side vector directly
>>> on the GPU.
>>>
>>>   Barry
>>>
>>>
>>>
>>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley <
>>> paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
>>>
>>> Hi,
>>>
>>> I want to compare different implementations of multigrid solvers for
>>> Nvidia GPUs using the poisson problem (starting from ksp tutorial example
>>> ex45.c).
>>> Therefore I am trying to get runtime results comparable to hpgmg-cuda
>>> 
>>> (finite-volume), i.e. using multiple warmup and measurement solves and
>>> 

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG).

When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?

The profile for BoomerAMG also doesn't really show the V-cycle behavior of the 
other implementations. Most of the runtime seems to go into calls to 
cusparseDcsrsv which might happen at the different levels, but the runtime of 
these kernels doesn't show the V-cycle pattern. According to the output with 
-pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I 
guess it is alright (and if not, this is probably the wrong place to discuss 
it).

When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?

Thank you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of their 
performance.   On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, 
thank you! Sadly I get a segfault when executing that example with PCMG and 
more than one level, i.e. with the minimal args:

$ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
===
Test: KSP performance - Poisson
    Input matrix: 27-pt finite difference stencil
    -n 100
    DoFs = 100
    Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
[0]PETSC ERROR: 

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing 
the crash.
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--

As the matrix is not created using DMDACreate3d I expected it to fail due to 
the missing geometric information, but I expected it to fail more gracefully 
than with a segfault.
I will try to combine bench_kspsolve.c with ex45.c to get easy MG 
preconditioning, especially since I am interested in the 7pt stencil for now.

Concerning my benchmarking loop from before: Is it generally discouraged to do 
this for KSPSolve due to PETSc cleverly/lazily skipping some of the work when 
doing the same solve multiple times or are the solves not iterated in 
bench_kspsolve.c (while the MatMuls are with -matmult) just to keep the runtime 
short?

Thanks,
Paul

On Monday, February 06, 2023 17:04 CET, Barry Smith  wrote:
 Paul,    I think