On Wed, Feb 22, 2023 at 5:45 PM Paul Grosse-Bley < paul.grosse-b...@ziti.uni-heidelberg.de> wrote:
> I thought to have observed the number of cycles > with -pc_mg_multiplicative_cycles to be dependent on rtol. But I might have > seen this with maxits=0 which would explain my missunderstanding of > richardson. > > I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I > still see the multiple PCApplys there with maxits=1 and richardson, while > they are gone for PCMG (due to getting rid of KSPSetComputeInitialGuess?). > Yes, that is true. AMGX is a black box, so we cannot look inside the V-cycle. Thanks, Matt > Best, > Paul Grosse-Bley > > > On Wednesday, February 22, 2023 23:20 CET, Barry Smith <bsm...@petsc.dev> > wrote: > > > > > > Preonly means exactly one application of the PC so it will never > converge by itself unless the PC is a full solver. > > Note there is a PCApplyRichardson_MG() that gets used automatically > with KSPRICHARSON. This does not have an"extra" application of the > preconditioner so 2 iterations of Richardson with MG will use 2 > applications of the V-cycle. So it is exactly "multigrid as a solver, > without a Krylov method", no extra work. So I don't think you need to make > any "compromises". > > Barry > > > > > On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley < > paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > > Hi again, > > I now found out that > > 1. preonly ignores -ksp_pc_side right (makes sense, I guess). > 2. richardson is incompatible with -ksp_pc_side right. > 3. preonly gives less output for -log_view -pc_mg_log than richardson. > 4. preonly also ignores -ksp_rtol etc.. > 5. preonly causes -log_view to measure incorrect timings for custom > stages, i.e. the time for the stage (219us) is significantly shorter than > the time for the KSPSolve inside the stage (~40ms). > > Number 4 will be problematic as I want to benchmark number of V-cycles and > runtime for a given rtol. At the same time I want to avoid richardson now > because of number 2 and the additional work of scaling the RHS. > > Is there any good way of just using MG V-cycles as a solver, i.e. without > interference from an outer Krylov solver and still iterate until > convergence? > Or will I just have to accept the additional V-cycle due to the left > application of th PC with richardson? > > I guess I could also manually change -pc_mg_multiplicative_cycles until > the residual gets low enough (using preonly), but that seems very > inefficient. > > Best, > Paul Große-Bley > > > > On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" < > paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > > > I was using the Richardson KSP type which I guess has the same behavior as > GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use > preonly from now on, where maxits=1 does what I want it to do. > > Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I > think for now all my problems are resolved for now. Thank you very much, > Barry and Mark! > > Best, > Paul Große-Bley > > > > On Wednesday, February 22, 2023 21:03 CET, Barry Smith <bsm...@petsc.dev> > wrote: > > > > > > > On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley < > paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > > Hi Barry, > > I think most of my "weird" observations came from the fact that I looked > at iterations of KSPSolve where the residual was already converged. PCMG > and PCGAMG do one V-cycle before even taking a look at the residual and > then independent of pc_mg_multiplicative_cycles stop if it is converged. > > Looking at iterations that are not converged with PCMG, > pc_mg_multiplicative_cycles works fine. > > At these iterations I also see the multiple calls to PCApply in a single > KSPSolve iteration which were throwing me off with PCAMGX before. > > The reason for these multiple applications of the preconditioner (tested > for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This > could be better documented, I think. > > > I do not understand what you are talking about with regard to maxits of > 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of > meaningless. > > The reason that there is a PCApply at the start of the solve is because > by default the KSPType is KSPGMRES which by default uses left > preconditioner which means the right hand side needs to be scaled by the > preconditioner before the KSP process starts. So in this configuration one > KSP iteration results in 2 PCApply. You can use -ksp_pc_side right to use > right preconditioning and then the number of PCApply will match the number > of KSP iterations. > > > Best, > Paul Große-Bley > > > > On Wednesday, February 22, 2023 20:15 CET, Barry Smith <bsm...@petsc.dev> > wrote: > > > > > > > On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley < > paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > > Hi Mark, > > I use Nvidia Nsight Systems with --trace > cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX > markers that come with -log_view. I.e. I get a nice view of all cuBLAS and > cuSPARSE calls (in addition to the actual kernels which are not always easy > to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more > detailed NVTX markers. > > The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear > because kernel runtimes on coarser levels are much shorter. At the coarsest > level, there normally isn't even enough work for the GPU (Nvidia A100) to > be fully occupied which is also visible in Nsight Systems. > > > Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most > definitely it changes the run. I am not understanding why it would not work > for you. If you use and don't use the option are the exact same counts > listed for all the events in the -log_view ? > > > I run only a single MPI rank with a single GPU, so profiling is > straighforward. > > Best, > Paul Große-Bley > > On Wednesday, February 22, 2023 18:24 CET, Mark Adams <mfad...@lbl.gov> > wrote: > > > > > On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley < > paul.grosse-b...@ziti.uni-heidelberg.de> wrote: > >> Hi Barry, >> >> after using VecCUDAGetArray to initialize the RHS, that kernel still gets >> called as part of KSPSolve instead of KSPSetup, but its runtime is way less >> significant than the cudaMemcpy before, so I guess I will leave it like >> this. Other than that I kept the code like in my first message in this >> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). >> >> The profiling results for PCMG and PCAMG look as I would expect them to >> look, i.e. one can nicely see the GPU load/kernel runtimes going down and >> up again for each V-cycle. >> >> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to >> make any difference. I would have expected to be able to increase the >> number of V-cycles per KSP iteration, but I keep seeing just a single >> V-cycle when changing the option (using PCMG). > > > How are you seeing this? > You might try -log_trace to see if you get two V cycles. > > >> >> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess >> between bench iterations to reset the solution vector does not seem to work >> as the residual keeps shrinking. Is this a bug? Any advice for working >> around this? >> > > > Looking at the doc > https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ > you use this with KSPSetComputeRHS. > > In src/snes/tests/ex13.c I just zero out the solution vector. > > >> The profile for BoomerAMG also doesn't really show the V-cycle behavior >> of the other implementations. Most of the runtime seems to go into calls to >> cusparseDcsrsv which might happen at the different levels, but the runtime >> of these kernels doesn't show the V-cycle pattern. According to the output >> with -pc_hypre_boomeramg_print_statistics it is doing the right thing >> though, so I guess it is alright (and if not, this is probably the wrong >> place to discuss it). > > >> When using PCAMGX, I see two PCApply (each showing a nice V-cycle >> behavior) calls in KSPSolve (three for the very first KSPSolve) while >> expecting just one. Each KSPSolve should do a single preconditioned >> Richardson iteration. Why is the preconditioner applied multiple times here? >> > > > Again, not sure what "see" is, but PCAMGX is pretty new and has not been > used much. > Note some KSP methods apply to the PC before the iteration. > > Mark > > >> Thank you, >> Paul Große-Bley >> >> >> On Monday, February 06, 2023 20:05 CET, Barry Smith <bsm...@petsc.dev> >> wrote: >> >> >> >> >> >> It should not crash, take a look at the test cases at the bottom of the >> file. You are likely correct if the code, unfortunately, does use >> DMCreateMatrix() it will not work out of the box for geometric multigrid. >> So it might be the wrong example for you. >> >> I don't know what you mean about clever. If you simply set the solution >> to zero at the beginning of the loop then it will just do the same solve >> multiple times. The setup should not do much of anything after the first >> solver. Thought usually solves are big enough that one need not run solves >> multiple times to get a good understanding of their performance. >> >> >> >> >> >> >> >> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley < >> paul.grosse-b...@ziti.uni-heidelberg.de> wrote: >> >> Hi Barry, >> >> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting >> point, thank you! Sadly I get a segfault when executing that example with >> PCMG and more than one level, i.e. with the minimal args: >> >> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 >> =========================================== >> Test: KSP performance - Poisson >> Input matrix: 27-pt finite difference stencil >> -n 100 >> DoFs = 1000000 >> Number of nonzeros = 26463592 >> >> Step1 - creating Vecs and Mat... >> Step2a - running PCSetUp()... >> [0]PETSC ERROR: >> ------------------------------------------------------------------------ >> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >> probably memory access out of range >> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger >> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >> https://petsc.org/release/faq/ >> [0]PETSC ERROR: or try >> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA >> systems to find memory corruption errors >> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, >> and run >> [0]PETSC ERROR: to get more information on the crash. >> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is >> causing the crash. >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 59. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> >> As the matrix is not created using DMDACreate3d I expected it to fail due >> to the missing geometric information, but I expected it to fail more >> gracefully than with a segfault. >> I will try to combine bench_kspsolve.c with ex45.c to get easy MG >> preconditioning, especially since I am interested in the 7pt stencil for >> now. >> >> Concerning my benchmarking loop from before: Is it generally discouraged >> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the >> work when doing the same solve multiple times or are the solves not >> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to >> keep the runtime short? >> >> Thanks, >> Paul >> >> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsm...@petsc.dev> >> wrote: >> >> >> >> >> >> Paul, >> >> I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to >> be used for simple benchmarking. >> >> You can use VecCudaGetArray() to access the GPU memory of the vector >> and then call a CUDA kernel to compute the right hand side vector directly >> on the GPU. >> >> Barry >> >> >> >> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley < >> paul.grosse-b...@ziti.uni-heidelberg.de> wrote: >> >> Hi, >> >> I want to compare different implementations of multigrid solvers for >> Nvidia GPUs using the poisson problem (starting from ksp tutorial example >> ex45.c). >> Therefore I am trying to get runtime results comparable to hpgmg-cuda >> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> >> (finite-volume), i.e. using multiple warmup and measurement solves and >> avoiding measuring setup time. >> For now I am using -log_view with added stages: >> >> PetscLogStageRegister("Solve Bench", &solve_bench_stage); >> for (int i = 0; i < BENCH_SOLVES; i++) { >> PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); >> // reset x >> PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve >> PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that >> everything is done >> >> PetscLogStagePush(solve_bench_stage); >> PetscCall(KSPSolve(ksp, NULL, NULL)); >> PetscLogStagePop(); >> } >> >> This snippet is preceded by a similar loop for warmup. >> >> When profiling this using Nsight Systems, I see that the very first solve >> is much slower which mostly correspods to H2D (host to device) copies and >> e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs >> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>, >> but probably insignificant in this case). The following solves have some >> overhead at the start from a H2D copy of a vector (the RHS I guess, as the >> copy is preceeded by a matrix-vector product) in the first MatResidual call >> (callchain: >> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My >> interpretation of the profiling results (i.e. cuBLAS calls) is that that >> vector is overwritten with the residual in Daxpy and therefore has to be >> copied again for the next iteration. >> >> Is there an elegant way of avoiding that H2D copy? I have seen some >> examples on constructing matrices directly on the GPU, but nothing about >> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At >> the moment I am using jacobi as smoother, but I would like to have a CUDA >> implementation of SOR instead. Is there a good way of achieving that, e.g. >> using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother as >> smoother in PCMG? Or is the overhead from constantly switching between >> PETSc and hypre too big? >> >> Thanks, >> Paul >> >> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>