On Tue, Feb 7, 2023 at 6:40 AM Matthew Knepley <knep...@gmail.com> wrote:
> On Tue, Feb 7, 2023 at 6:23 AM Mark Adams <mfad...@lbl.gov> wrote: > >> I do one complete solve to get everything setup, to be safe. >> >> src/ts/tutorials/ex13.c does this and runs multiple solves, if you like >> but one solve is probably fine. >> > > I think that is SNES ex13 > Yes, it is src/snes/tests/ex13.c > > Matt > > >> This was designed as a benchmark and is nice because it can do any order >> FE solve of Poisson (uses DM/PetscFE, slow). >> src/ksp/ksp/tutorials/ex56.c is old school, hardwired for elasticity but >> is simpler and the setup is faster if you are doing large problems per MPI >> process. >> >> Mark >> >> On Mon, Feb 6, 2023 at 2:06 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> It should not crash, take a look at the test cases at the bottom of the >>> file. You are likely correct if the code, unfortunately, does use >>> DMCreateMatrix() it will not work out of the box for geometric multigrid. >>> So it might be the wrong example for you. >>> >>> I don't know what you mean about clever. If you simply set the >>> solution to zero at the beginning of the loop then it will just do the same >>> solve multiple times. The setup should not do much of anything after the >>> first solver. Thought usually solves are big enough that one need not run >>> solves multiple times to get a good understanding of their performance. >>> >>> >>> >>> >>> >>> >>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley < >>> paul.grosse-b...@ziti.uni-heidelberg.de> wrote: >>> >>> Hi Barry, >>> >>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting >>> point, thank you! Sadly I get a segfault when executing that example with >>> PCMG and more than one level, i.e. with the minimal args: >>> >>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 >>> =========================================== >>> Test: KSP performance - Poisson >>> Input matrix: 27-pt finite difference stencil >>> -n 100 >>> DoFs = 1000000 >>> Number of nonzeros = 26463592 >>> >>> Step1 - creating Vecs and Mat... >>> Step2a - running PCSetUp()... >>> [0]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>> probably memory access out of range >>> [0]PETSC ERROR: Try option -start_in_debugger or >>> -on_error_attach_debugger >>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >>> https://petsc.org/release/faq/ >>> [0]PETSC ERROR: or try >>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA >>> systems to find memory corruption errors >>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, >>> and run >>> [0]PETSC ERROR: to get more information on the crash. >>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is >>> causing the crash. >>> >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>> with errorcode 59. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> >>> -------------------------------------------------------------------------- >>> >>> As the matrix is not created using DMDACreate3d I expected it to fail >>> due to the missing geometric information, but I expected it to fail more >>> gracefully than with a segfault. >>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG >>> preconditioning, especially since I am interested in the 7pt stencil for >>> now. >>> >>> Concerning my benchmarking loop from before: Is it generally discouraged >>> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the >>> work when doing the same solve multiple times or are the solves not >>> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to >>> keep the runtime short? >>> >>> Thanks, >>> Paul >>> >>> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsm...@petsc.dev> >>> wrote: >>> >>> >>> >>> >>> >>> Paul, >>> >>> I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to >>> be used for simple benchmarking. >>> >>> You can use VecCudaGetArray() to access the GPU memory of the vector >>> and then call a CUDA kernel to compute the right hand side vector directly >>> on the GPU. >>> >>> Barry >>> >>> >>> >>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley < >>> paul.grosse-b...@ziti.uni-heidelberg.de> wrote: >>> >>> Hi, >>> >>> I want to compare different implementations of multigrid solvers for >>> Nvidia GPUs using the poisson problem (starting from ksp tutorial example >>> ex45.c). >>> Therefore I am trying to get runtime results comparable to hpgmg-cuda >>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> >>> (finite-volume), i.e. using multiple warmup and measurement solves and >>> avoiding measuring setup time. >>> For now I am using -log_view with added stages: >>> >>> PetscLogStageRegister("Solve Bench", &solve_bench_stage); >>> for (int i = 0; i < BENCH_SOLVES; i++) { >>> PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, >>> NULL)); // reset x >>> PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve >>> PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that >>> everything is done >>> >>> PetscLogStagePush(solve_bench_stage); >>> PetscCall(KSPSolve(ksp, NULL, NULL)); >>> PetscLogStagePop(); >>> } >>> >>> This snippet is preceded by a similar loop for warmup. >>> >>> When profiling this using Nsight Systems, I see that the very first >>> solve is much slower which mostly correspods to H2D (host to device) copies >>> and e.g. cuBLAS setup (maybe also paging overheads as mentioned in the >>> docs >>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>, >>> but probably insignificant in this case). The following solves have some >>> overhead at the start from a H2D copy of a vector (the RHS I guess, as the >>> copy is preceeded by a matrix-vector product) in the first MatResidual call >>> (callchain: >>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My >>> interpretation of the profiling results (i.e. cuBLAS calls) is that that >>> vector is overwritten with the residual in Daxpy and therefore has to be >>> copied again for the next iteration. >>> >>> Is there an elegant way of avoiding that H2D copy? I have seen some >>> examples on constructing matrices directly on the GPU, but nothing about >>> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At >>> the moment I am using jacobi as smoother, but I would like to have a CUDA >>> implementation of SOR instead. Is there a good way of achieving that, e.g. >>> using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother as >>> smoother in PCMG? Or is the overhead from constantly switching between >>> PETSc and hypre too big? >>> >>> Thanks, >>> Paul >>> >>> >>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >