Paul,

   I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to be 
used for simple benchmarking. 

   You can use VecCudaGetArray() to access the GPU memory of the vector and 
then call a CUDA kernel to compute the right hand side vector directly on the 
GPU.

  Barry


> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley 
> <paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
> 
> Hi,
> 
> I want to compare different implementations of multigrid solvers for Nvidia 
> GPUs using the poisson problem (starting from ksp tutorial example ex45.c).
> Therefore I am trying to get runtime results comparable to hpgmg-cuda 
> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> (finite-volume), 
> i.e. using multiple warmup and measurement solves and avoiding measuring 
> setup time.
> For now I am using -log_view with added stages:
> 
> PetscLogStageRegister("Solve Bench", &solve_bench_stage);
>   for (int i = 0; i < BENCH_SOLVES; i++) {
>     PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // 
> reset x
>     PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
>     PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that 
> everything is done
> 
>     PetscLogStagePush(solve_bench_stage);
>     PetscCall(KSPSolve(ksp, NULL, NULL));
>     PetscLogStagePop();
>   }
> 
> This snippet is preceded by a similar loop for warmup.
> 
> When profiling this using Nsight Systems, I see that the very first solve is 
> much slower which mostly correspods to H2D (host to device) copies and e.g. 
> cuBLAS setup (maybe also paging overheads as mentioned in the docs 
> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>,
>  but probably insignificant in this case). The following solves have some 
> overhead at the start from a H2D copy of a vector (the RHS I guess, as the 
> copy is preceeded by a matrix-vector product) in the first MatResidual call 
> (callchain: KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). 
> My interpretation of the profiling results (i.e. cuBLAS calls) is that that 
> vector is overwritten with the residual in Daxpy and therefore has to be 
> copied again for the next iteration.
> 
> Is there an elegant way of avoiding that H2D copy? I have seen some examples 
> on constructing matrices directly on the GPU, but nothing about vectors. Any 
> further tips for benchmarking (vs profiling) PETSc solvers? At the moment I 
> am using jacobi as smoother, but I would like to have a CUDA implementation 
> of SOR instead. Is there a good way of achieving that, e.g. using PCHYPREs 
> boomeramg with a single level and "SOR/Jacobi"-smoother  as smoother in PCMG? 
> Or is the overhead from constantly switching between PETSc and hypre too big?
> 
> Thanks,
> Paul

Reply via email to