Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Barry Smith Wed, 22 Feb 2023 11:16:20 -0800


> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
> <paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
> 
> Hi Mark,
> 
> I use Nvidia Nsight Systems with --trace 
> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
> that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
> calls (in addition to the actual kernels which are not always easy to 
> attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
> NVTX markers.
> 
> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear 
> because kernel runtimes on coarser levels are much shorter. At the coarsest 
> level, there normally isn't even enough work for the GPU (Nvidia A100) to be 
> fully occupied which is also visible in Nsight Systems.


  Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most definitely 
it changes the run. I am not understanding why it would not work for you. If 
you use and don't use the option are the exact same counts listed for all the 
events in the -log_view ? 
> 
> 
> I run only a single MPI rank with a single GPU, so profiling is 
> straighforward.
> 
> Best,
> Paul Große-Bley
> 
> On Wednesday, February 22, 2023 18:24 CET, Mark Adams <mfad...@lbl.gov> wrote:
>  
>> 
>>  
>>  
>> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
>> <paul.grosse-b...@ziti.uni-heidelberg.de 
>> <mailto:paul.grosse-b...@ziti.uni-heidelberg.de>> wrote:
>>> Hi Barry,
>>> 
>>> after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
>>> called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
>>> significant than the cudaMemcpy before, so I guess I will leave it like 
>>> this. Other than that I kept the code like in my first message in this 
>>> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
>>> 
>>> The profiling results for PCMG and PCAMG look as I would expect them to 
>>> look, i.e. one can nicely see the GPU load/kernel runtimes going down and 
>>> up again for each V-cycle.
>>> 
>>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to 
>>> make any difference. I would have expected to be able to increase the 
>>> number of V-cycles per KSP iteration, but I keep seeing just a single 
>>> V-cycle when changing the option (using PCMG).
>>  
>> How are you seeing this? 
>> You might try -log_trace to see if you get two V cycles.
>>  
>>> 
>>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess 
>>> between bench iterations to reset the solution vector does not seem to work 
>>> as the residual keeps shrinking. Is this a bug? Any advice for working 
>>> around this?
>>>  
>>  
>> Looking at the doc 
>> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ 
>> you use this with  KSPSetComputeRHS.
>>  
>> In src/snes/tests/ex13.c I just zero out the solution vector.
>>   
>>> The profile for BoomerAMG also doesn't really show the V-cycle behavior of 
>>> the other implementations. Most of the runtime seems to go into calls to 
>>> cusparseDcsrsv which might happen at the different levels, but the runtime 
>>> of these kernels doesn't show the V-cycle pattern. According to the output 
>>> with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
>>> though, so I guess it is alright (and if not, this is probably the wrong 
>>> place to discuss it).
>>> 
>>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
>>> calls in KSPSolve (three for the very first KSPSolve) while expecting just 
>>> one. Each KSPSolve should do a single preconditioned Richardson iteration. 
>>> Why is the preconditioner applied multiple times here?
>>>  
>>  
>> Again, not sure what "see" is, but PCAMGX is pretty new and has not been 
>> used much.
>> Note some KSP methods apply to the PC before the iteration.
>>  
>> Mark 
>>  
>>> Thank you,
>>> Paul Große-Bley
>>> 
>>> 
>>> On Monday, February 06, 2023 20:05 CET, Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>>  
>>>> 
>>>>  
>>>  
>>>  It should not crash, take a look at the test cases at the bottom of the 
>>> file. You are likely correct if the code, unfortunately, does use 
>>> DMCreateMatrix() it will not work out of the box for geometric multigrid. 
>>> So it might be the wrong example for you.
>>>  
>>>   I don't know what you mean about clever. If you simply set the solution 
>>> to zero at the beginning of the loop then it will just do the same solve 
>>> multiple times. The setup should not do much of anything after the first 
>>> solver.  Thought usually solves are big enough that one need not run solves 
>>> multiple times to get a good understanding of their performance.
>>>  
>>>  
>>>   
>>>  
>>>  
>>>  
>>>> 
>>>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
>>>> <paul.grosse-b...@ziti.uni-heidelberg.de 
>>>> <mailto:paul.grosse-b...@ziti.uni-heidelberg.de>> wrote:
>>>>  
>>>> Hi Barry,
>>>> 
>>>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting 
>>>> point, thank you! Sadly I get a segfault when executing that example with 
>>>> PCMG and more than one level, i.e. with the minimal args:
>>>> 
>>>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
>>>> ===========================================
>>>> Test: KSP performance - Poisson
>>>>     Input matrix: 27-pt finite difference stencil
>>>>     -n 100
>>>>     DoFs = 1000000
>>>>     Number of nonzeros = 26463592
>>>> 
>>>> Step1  - creating Vecs and Mat...
>>>> Step2a - running PCSetUp()...
>>>> [0]PETSC ERROR: 
>>>> ------------------------------------------------------------------------
>>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
>>>> probably memory access out of range
>>>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
>>>> https://petsc.org/release/faq/
>>>> [0]PETSC ERROR: or try 
>>>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA 
>>>> systems to find memory corruption errors
>>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and 
>>>> run
>>>> [0]PETSC ERROR: to get more information on the crash.
>>>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is 
>>>> causing the crash.
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>> with errorcode 59.
>>>> 
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> As the matrix is not created using DMDACreate3d I expected it to fail due 
>>>> to the missing geometric information, but I expected it to fail more 
>>>> gracefully than with a segfault.
>>>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG 
>>>> preconditioning, especially since I am interested in the 7pt stencil for 
>>>> now.
>>>> 
>>>> Concerning my benchmarking loop from before: Is it generally discouraged 
>>>> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the 
>>>> work when doing the same solve multiple times or are the solves not 
>>>> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to 
>>>> keep the runtime short?
>>>> 
>>>> Thanks,
>>>> Paul
>>>> 
>>>> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsm...@petsc.dev 
>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>  
>>>>> 
>>>>>  
>>>>  
>>>>   Paul,
>>>>  
>>>>    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to 
>>>> be used for simple benchmarking. 
>>>>  
>>>>    You can use VecCudaGetArray() to access the GPU memory of the vector 
>>>> and then call a CUDA kernel to compute the right hand side vector directly 
>>>> on the GPU.
>>>>  
>>>>   Barry
>>>>  
>>>>  
>>>>> 
>>>>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley 
>>>>> <paul.grosse-b...@ziti.uni-heidelberg.de 
>>>>> <mailto:paul.grosse-b...@ziti.uni-heidelberg.de>> wrote:
>>>>>  
>>>>> Hi,
>>>>> 
>>>>> I want to compare different implementations of multigrid solvers for 
>>>>> Nvidia GPUs using the poisson problem (starting from ksp tutorial example 
>>>>> ex45.c).
>>>>> Therefore I am trying to get runtime results comparable to hpgmg-cuda 
>>>>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> 
>>>>> (finite-volume), i.e. using multiple warmup and measurement solves and 
>>>>> avoiding measuring setup time.
>>>>> For now I am using -log_view with added stages:
>>>>> 
>>>>> PetscLogStageRegister("Solve Bench", &solve_bench_stage);
>>>>>   for (int i = 0; i < BENCH_SOLVES; i++) {
>>>>>     PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); 
>>>>> // reset x
>>>>>     PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
>>>>>     PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that 
>>>>> everything is done
>>>>> 
>>>>>     PetscLogStagePush(solve_bench_stage);
>>>>>     PetscCall(KSPSolve(ksp, NULL, NULL));
>>>>>     PetscLogStagePop();
>>>>>   }
>>>>> 
>>>>> This snippet is preceded by a similar loop for warmup.
>>>>> 
>>>>> When profiling this using Nsight Systems, I see that the very first solve 
>>>>> is much slower which mostly correspods to H2D (host to device) copies and 
>>>>> e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs 
>>>>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>,
>>>>>  but probably insignificant in this case). The following solves have some 
>>>>> overhead at the start from a H2D copy of a vector (the RHS I guess, as 
>>>>> the copy is preceeded by a matrix-vector product) in the first 
>>>>> MatResidual call (callchain: 
>>>>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My 
>>>>> interpretation of the profiling results (i.e. cuBLAS calls) is that that 
>>>>> vector is overwritten with the residual in Daxpy and therefore has to be 
>>>>> copied again for the next iteration.
>>>>> 
>>>>> Is there an elegant way of avoiding that H2D copy? I have seen some 
>>>>> examples on constructing matrices directly on the GPU, but nothing about 
>>>>> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? 
>>>>> At the moment I am using jacobi as smoother, but I would like to have a 
>>>>> CUDA implementation of SOR instead. Is there a good way of achieving 
>>>>> that, e.g. using PCHYPREs boomeramg with a single level and 
>>>>> "SOR/Jacobi"-smoother  as smoother in PCMG? Or is the overhead from 
>>>>> constantly switching between PETSc and hypre too big?
>>>>> 
>>>>> Thanks,
>>>>> Paul

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Reply via email to