Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Paul Grosse-Bley Wed, 22 Feb 2023 10:10:50 -0800

Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.


The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams <mfad...@lbl.gov> wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
<paul.grosse-b...@ziti.uni-heidelberg.de> wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith <bsm...@petsc.dev> wrote:
    It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of their 
performance.       On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
<paul.grosse-b...@ziti.uni-heidelberg.de> wrote: Hi Barry,

src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, 
thank you! Sadly I get a segfault when executing that example with PCMG and 
more than one level, i.e. with the minimal args:

$ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
===========================================
Test: KSP performance - Poisson
    Input matrix: 27-pt finite difference stencil
    -n 100
    DoFs = 1000000
    Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing 
the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

As the matrix is not created using DMDACreate3d I expected it to fail due to 
the missing geometric information, but I expected it to fail more gracefully 
than with a segfault.
I will try to combine bench_kspsolve.c with ex45.c to get easy MG 
preconditioning, especially since I am interested in the 7pt stencil for now.

Concerning my benchmarking loop from before: Is it generally discouraged to do 
this for KSPSolve due to PETSc cleverly/lazily skipping some of the work when 
doing the same solve multiple times or are the solves not iterated in 
bench_kspsolve.c (while the MatMuls are with -matmult) just to keep the runtime 
short?

Thanks,
Paul

On Monday, February 06, 2023 17:04 CET, Barry Smith <bsm...@petsc.dev> wrote:
     Paul,    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code 
intended to be used for simple benchmarking.     You can use VecCudaGetArray() 
to access the GPU memory of the vector and then call a CUDA kernel to compute 
the right hand side vector directly on the GPU.   Barry  On Feb 6, 2023, at 
10:57 AM, Paul Grosse-Bley <paul.grosse-b...@ziti.uni-heidelberg.de> wrote: Hi,

I want to compare different implementations of multigrid solvers for Nvidia 
GPUs using the poisson problem (starting from ksp tutorial example ex45.c).
Therefore I am trying to get runtime results comparable to hpgmg-cuda 
(finite-volume), i.e. using multiple warmup and measurement solves and avoiding 
measuring setup time.
For now I am using -log_view with added stages:

PetscLogStageRegister("Solve Bench", &solve_bench_stage);
  for (int i = 0; i < BENCH_SOLVES; i++) {
    PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // 
reset x
    PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
    PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that 
everything is done

    PetscLogStagePush(solve_bench_stage);
    PetscCall(KSPSolve(ksp, NULL, NULL));
    PetscLogStagePop();
  }

This snippet is preceded by a similar loop for warmup.

When profiling this using Nsight Systems, I see that the very first solve is 
much slower which mostly correspods to H2D (host to device) copies and e.g. 
cuBLAS setup (maybe also paging overheads as mentioned in the docs, but 
probably insignificant in this case). The following solves have some overhead 
at the start from a H2D copy of a vector (the RHS I guess, as the copy is 
preceeded by a matrix-vector product) in the first MatResidual call (callchain: 
KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My 
interpretation of the profiling results (i.e. cuBLAS calls) is that that vector 
is overwritten with the residual in Daxpy and therefore has to be copied again 
for the next iteration.

Is there an elegant way of avoiding that H2D copy? I have seen some examples on 
constructing matrices directly on the GPU, but nothing about vectors. Any 
further tips for benchmarking (vs profiling) PETSc solvers? At the moment I am 
using jacobi as smoother, but I would like to have a CUDA implementation of SOR 
instead. Is there a good way of achieving that, e.g. using PCHYPREs boomeramg 
with a single level and "SOR/Jacobi"-smoother  as smoother in PCMG? Or is the 
overhead from constantly switching between PETSc and hypre too big?

Thanks,
Paul

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Reply via email to