Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Matthew Knepley
On Wed, Feb 22, 2023 at 5:45 PM Paul Grosse-Bley < paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > I thought to have observed the number of cycles > with -pc_mg_multiplicative_cycles to be dependent on rtol. But I might have > seen this with maxits=0 which would explain my missunderstanding of

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
I thought to have observed the number of cycles with  -pc_mg_multiplicative_cycles to be dependent on rtol. But I might have seen this with maxits=0 which would explain my missunderstanding of richardson. I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I still see the

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith
Preonly means exactly one application of the PC so it will never converge by itself unless the PC is a full solver. Note there is a PCApplyRichardson_MG() that gets used automatically with KSPRICHARSON. This does not have an"extra" application of the preconditioner so 2 iterations of

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Matthew Knepley
On Wed, Feb 22, 2023 at 4:57 PM Paul Grosse-Bley < paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > Hi again, > > I now found out that > > 1. preonly ignores -ksp_pc_side right (makes sense, I guess). > 2. richardson is incompatible with -ksp_pc_side right. > 3. preonly gives less output for

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
Hi again, I now found out that 1. preonly ignores -ksp_pc_side right (makes sense, I guess). 2. richardson is incompatible with -ksp_pc_side right. 3. preonly gives less output for -log_view -pc_mg_log than richardson. 4. preonly also ignores -ksp_rtol etc.. 5. preonly causes -log_view to

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
I was using the Richardson KSP type which I guess has the same behavior as GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use preonly from now on, where maxits=1 does what I want it to do. Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
Hi Barry, the picture keeps getting clearer. I did not use KSPSetInitialGuessNonzero or the corresponding option, but using KSPSetComputeInitialGuess probably sets it automatically (without telling one in the output of -help). I was also confused by the preonly KSP type not working which is

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith
> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley > wrote: > > Hi Barry, > > I think most of my "weird" observations came from the fact that I looked at > iterations of KSPSolve where the residual was already converged. PCMG and > PCGAMG do one V-cycle before even taking a look at the

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
Hi Barry, I think most of my "weird" observations came from the fact that I looked at iterations of KSPSolve where the residual was already converged. PCMG and PCGAMG do one V-cycle before even taking a look at the residual and then independent of pc_mg_multiplicative_cycles stop if it is

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith
> On Feb 22, 2023, at 2:19 PM, Paul Grosse-Bley > wrote: > > Hi again, > > after checking with -ksp_monitor for PCMG, it seems my assumption that I > could reset the solution by calling KSPSetComputeInitialGuess and then > KSPSetupwas generally wrong and BoomerAMG was just the only

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
Hi again, after checking with -ksp_monitor for PCMG, it seems my assumption that I could reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas generally wrong and BoomerAMG was just the only preconditioner that cleverly stops doing work when the residual is already

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Barry Smith
> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley > wrote: > > Hi Mark, > > I use Nvidia Nsight Systems with --trace > cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers > that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE > calls (in

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Mark Adams
OK, Nsight Systems is a good way to see what is going on. So all three of your solvers are not traversing the MG hierching with the correct logic. I don't know about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx library directly from out interface. Some things to try: *

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Mark Adams
On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley < paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > Hi Barry, > > after using VecCUDAGetArray to initialize the RHS, that kernel still gets > called as part of KSPSolve instead of KSPSetup, but its runtime is way less > significant than the

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Mark Adams
On Tue, Feb 7, 2023 at 6:40 AM Matthew Knepley wrote: > On Tue, Feb 7, 2023 at 6:23 AM Mark Adams wrote: > >> I do one complete solve to get everything setup, to be safe. >> >> src/ts/tutorials/ex13.c does this and runs multiple solves, if you like >> but one solve is probably fine. >> > > I

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley
Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-07 Thread Matthew Knepley
On Tue, Feb 7, 2023 at 6:23 AM Mark Adams wrote: > I do one complete solve to get everything setup, to be safe. > > src/ts/tutorials/ex13.c does this and runs multiple solves, if you like > but one solve is probably fine. > I think that is SNES ex13 Matt > This was designed as a benchmark

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-07 Thread Mark Adams
I do one complete solve to get everything setup, to be safe. src/ts/tutorials/ex13.c does this and runs multiple solves, if you like but one solve is probably fine. This was designed as a benchmark and is nice because it can do any order FE solve of Poisson (uses DM/PetscFE, slow).

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-06 Thread Barry Smith
It should not crash, take a look at the test cases at the bottom of the file. You are likely correct if the code, unfortunately, does use DMCreateMatrix() it will not work out of the box for geometric multigrid. So it might be the wrong example for you. I don't know what you mean about

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-06 Thread Paul Grosse-Bley
Hi Barry, src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, thank you! Sadly I get a segfault when executing that example with PCMG and more than one level, i.e. with the minimal args: $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-06 Thread Barry Smith
Paul, I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to be used for simple benchmarking. You can use VecCudaGetArray() to access the GPU memory of the vector and then call a CUDA kernel to compute the right hand side vector directly on the GPU. Barry > On

[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-06 Thread Paul Grosse-Bley
Hi, I want to compare different implementations of multigrid solvers for Nvidia GPUs using the poisson problem (starting from ksp tutorial example ex45.c). Therefore I am trying to get runtime results comparable to hpgmg-cuda (finite-volume), i.e. using multiple warmup and measurement solves