> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley > <paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > > Hi Mark, > > I use Nvidia Nsight Systems with --trace > cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers > that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE > calls (in addition to the actual kernels which are not always easy to > attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed > NVTX markers. > > The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear > because kernel runtimes on coarser levels are much shorter. At the coarsest > level, there normally isn't even enough work for the GPU (Nvidia A100) to be > fully occupied which is also visible in Nsight Systems.
Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not understanding why it would not work for you. If you use and don't use the option are the exact same counts listed for all the events in the -log_view ? > > > I run only a single MPI rank with a single GPU, so profiling is > straighforward. > > Best, > Paul Große-Bley > > On Wednesday, February 22, 2023 18:24 CET, Mark Adams <mfad...@lbl.gov> wrote: > >> >> >> >> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley >> <paul.grosse-b...@ziti.uni-heidelberg.de >> <mailto:paul.grosse-b...@ziti.uni-heidelberg.de>> wrote: >>> Hi Barry, >>> >>> after using VecCUDAGetArray to initialize the RHS, that kernel still gets >>> called as part of KSPSolve instead of KSPSetup, but its runtime is way less >>> significant than the cudaMemcpy before, so I guess I will leave it like >>> this. Other than that I kept the code like in my first message in this >>> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). >>> >>> The profiling results for PCMG and PCAMG look as I would expect them to >>> look, i.e. one can nicely see the GPU load/kernel runtimes going down and >>> up again for each V-cycle. >>> >>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to >>> make any difference. I would have expected to be able to increase the >>> number of V-cycles per KSP iteration, but I keep seeing just a single >>> V-cycle when changing the option (using PCMG). >> >> How are you seeing this? >> You might try -log_trace to see if you get two V cycles. >> >>> >>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess >>> between bench iterations to reset the solution vector does not seem to work >>> as the residual keeps shrinking. Is this a bug? Any advice for working >>> around this? >>> >> >> Looking at the doc >> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ >> you use this with KSPSetComputeRHS. >> >> In src/snes/tests/ex13.c I just zero out the solution vector. >> >>> The profile for BoomerAMG also doesn't really show the V-cycle behavior of >>> the other implementations. Most of the runtime seems to go into calls to >>> cusparseDcsrsv which might happen at the different levels, but the runtime >>> of these kernels doesn't show the V-cycle pattern. According to the output >>> with -pc_hypre_boomeramg_print_statistics it is doing the right thing >>> though, so I guess it is alright (and if not, this is probably the wrong >>> place to discuss it). >>> >>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) >>> calls in KSPSolve (three for the very first KSPSolve) while expecting just >>> one. Each KSPSolve should do a single preconditioned Richardson iteration. >>> Why is the preconditioner applied multiple times here? >>> >> >> Again, not sure what "see" is, but PCAMGX is pretty new and has not been >> used much. >> Note some KSP methods apply to the PC before the iteration. >> >> Mark >> >>> Thank you, >>> Paul Große-Bley >>> >>> >>> On Monday, February 06, 2023 20:05 CET, Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>> >>>> >>>> >>> >>> It should not crash, take a look at the test cases at the bottom of the >>> file. You are likely correct if the code, unfortunately, does use >>> DMCreateMatrix() it will not work out of the box for geometric multigrid. >>> So it might be the wrong example for you. >>> >>> I don't know what you mean about clever. If you simply set the solution >>> to zero at the beginning of the loop then it will just do the same solve >>> multiple times. The setup should not do much of anything after the first >>> solver. Thought usually solves are big enough that one need not run solves >>> multiple times to get a good understanding of their performance. >>> >>> >>> >>> >>> >>> >>>> >>>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley >>>> <paul.grosse-b...@ziti.uni-heidelberg.de >>>> <mailto:paul.grosse-b...@ziti.uni-heidelberg.de>> wrote: >>>> >>>> Hi Barry, >>>> >>>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting >>>> point, thank you! Sadly I get a segfault when executing that example with >>>> PCMG and more than one level, i.e. with the minimal args: >>>> >>>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 >>>> =========================================== >>>> Test: KSP performance - Poisson >>>> Input matrix: 27-pt finite difference stencil >>>> -n 100 >>>> DoFs = 1000000 >>>> Number of nonzeros = 26463592 >>>> >>>> Step1 - creating Vecs and Mat... >>>> Step2a - running PCSetUp()... >>>> [0]PETSC ERROR: >>>> ------------------------------------------------------------------------ >>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>>> probably memory access out of range >>>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger >>>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >>>> https://petsc.org/release/faq/ >>>> [0]PETSC ERROR: or try >>>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA >>>> systems to find memory corruption errors >>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and >>>> run >>>> [0]PETSC ERROR: to get more information on the crash. >>>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is >>>> causing the crash. >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>> with errorcode 59. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> -------------------------------------------------------------------------- >>>> >>>> As the matrix is not created using DMDACreate3d I expected it to fail due >>>> to the missing geometric information, but I expected it to fail more >>>> gracefully than with a segfault. >>>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG >>>> preconditioning, especially since I am interested in the 7pt stencil for >>>> now. >>>> >>>> Concerning my benchmarking loop from before: Is it generally discouraged >>>> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the >>>> work when doing the same solve multiple times or are the solves not >>>> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to >>>> keep the runtime short? >>>> >>>> Thanks, >>>> Paul >>>> >>>> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsm...@petsc.dev >>>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>>> >>>>> >>>> >>>> Paul, >>>> >>>> I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to >>>> be used for simple benchmarking. >>>> >>>> You can use VecCudaGetArray() to access the GPU memory of the vector >>>> and then call a CUDA kernel to compute the right hand side vector directly >>>> on the GPU. >>>> >>>> Barry >>>> >>>> >>>>> >>>>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley >>>>> <paul.grosse-b...@ziti.uni-heidelberg.de >>>>> <mailto:paul.grosse-b...@ziti.uni-heidelberg.de>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I want to compare different implementations of multigrid solvers for >>>>> Nvidia GPUs using the poisson problem (starting from ksp tutorial example >>>>> ex45.c). >>>>> Therefore I am trying to get runtime results comparable to hpgmg-cuda >>>>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> >>>>> (finite-volume), i.e. using multiple warmup and measurement solves and >>>>> avoiding measuring setup time. >>>>> For now I am using -log_view with added stages: >>>>> >>>>> PetscLogStageRegister("Solve Bench", &solve_bench_stage); >>>>> for (int i = 0; i < BENCH_SOLVES; i++) { >>>>> PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); >>>>> // reset x >>>>> PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve >>>>> PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that >>>>> everything is done >>>>> >>>>> PetscLogStagePush(solve_bench_stage); >>>>> PetscCall(KSPSolve(ksp, NULL, NULL)); >>>>> PetscLogStagePop(); >>>>> } >>>>> >>>>> This snippet is preceded by a similar loop for warmup. >>>>> >>>>> When profiling this using Nsight Systems, I see that the very first solve >>>>> is much slower which mostly correspods to H2D (host to device) copies and >>>>> e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs >>>>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>, >>>>> but probably insignificant in this case). The following solves have some >>>>> overhead at the start from a H2D copy of a vector (the RHS I guess, as >>>>> the copy is preceeded by a matrix-vector product) in the first >>>>> MatResidual call (callchain: >>>>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My >>>>> interpretation of the profiling results (i.e. cuBLAS calls) is that that >>>>> vector is overwritten with the residual in Daxpy and therefore has to be >>>>> copied again for the next iteration. >>>>> >>>>> Is there an elegant way of avoiding that H2D copy? I have seen some >>>>> examples on constructing matrices directly on the GPU, but nothing about >>>>> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? >>>>> At the moment I am using jacobi as smoother, but I would like to have a >>>>> CUDA implementation of SOR instead. Is there a good way of achieving >>>>> that, e.g. using PCHYPREs boomeramg with a single level and >>>>> "SOR/Jacobi"-smoother as smoother in PCMG? Or is the overhead from >>>>> constantly switching between PETSc and hypre too big? >>>>> >>>>> Thanks, >>>>> Paul