Junchao, Thank you for working on this. If you open the parameter file for, say, the PSI_2 system test case (benchmarks/params_system_PSI_2.txt), simply add -dm_mat_type aijkokkos -dm_vec_type kokkos?` to the "petscArgs=" field (or the corresponding cusparse/cuda option).
Thanks, Philip Fackler Research Software Engineer, Application Engineering Group Advanced Computing Systems Research Section Computer Science and Mathematics Division Oak Ridge National Laboratory ________________________________ From: Junchao Zhang <junchao.zh...@gmail.com> Sent: Thursday, December 1, 2022 17:05 To: Fackler, Philip <fackle...@ornl.gov> Cc: xolotl-psi-developm...@lists.sourceforge.net <xolotl-psi-developm...@lists.sourceforge.net>; petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>; Blondel, Sophie <sblon...@utk.edu>; Roth, Philip <rot...@ornl.gov> Subject: Re: [EXTERNAL] Re: [petsc-users] Kokkos backend for Mat and Vec diverging when running on CUDA device. Hi, Philip, Sorry for the long delay. I could not get something useful from the -log_view output. Since I have already built xolotl, could you give me instructions on how to do a xolotl test to reproduce the divergence with petsc GPU backends (but fine on CPU)? Thank you. --Junchao Zhang On Wed, Nov 16, 2022 at 1:38 PM Fackler, Philip <fackle...@ornl.gov<mailto:fackle...@ornl.gov>> wrote: ------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------ Unknown Name on a named PC0115427 with 1 processor, by 4pf Wed Nov 16 14:36:46 2022 Using Petsc Development GIT revision: v3.18.1-115-gdca010e0e9a GIT Date: 2022-10-28 14:39:41 +0000 Max Max/Min Avg Total Time (sec): 6.023e+00 1.000 6.023e+00 Objects: 1.020e+02 1.000 1.020e+02 Flops: 1.080e+09 1.000 1.080e+09 1.080e+09 Flops/sec: 1.793e+08 1.000 1.793e+08 1.793e+08 MPI Msg Count: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Msg Len (bytes): 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Reductions: 0.000e+00 0.000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 6.0226e+00 100.0% 1.0799e+09 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors) CpuToGpu Count: total number of CPU to GPU copies per processor CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor) GpuToCpu Count: total number of GPU to CPU copies per processor GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor) GPU %F: percent flops on GPU in this event ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F ------------------------------------------------------------------------------------------------------------------------ --------------------------------------- --- Event Stage 0: Main Stage BuildTwoSided 3 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 DMCreateMat 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFSetGraph 3 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFSetUp 3 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFPack 4647 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFUnpack 4647 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecDot 190 1.0 nan nan 2.11e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecMDot 775 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecNorm 1728 1.0 nan nan 1.92e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecScale 1983 1.0 nan nan 6.24e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecCopy 780 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecSet 4955 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecAXPY 190 1.0 nan nan 2.11e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecAYPX 597 1.0 nan nan 6.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecAXPBYCZ 643 1.0 nan nan 1.79e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecWAXPY 502 1.0 nan nan 5.58e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecMAXPY 1159 1.0 nan nan 3.68e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecScatterBegin 4647 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 -nan -nan 2 5.14e-03 0 0.00e+00 0 VecScatterEnd 4647 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecReduceArith 380 1.0 nan nan 4.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecReduceComm 190 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecNormalize 965 1.0 nan nan 1.61e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 TSStep 20 1.0 5.8699e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 0.0e+00 97100 0 0 0 97100 0 0 0 184 -nan 2 5.14e-03 0 0.00e+00 54 TSFunctionEval 597 1.0 nan nan 6.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 63 1 0 0 0 63 1 0 0 0 -nan -nan 1 3.36e-04 0 0.00e+00 100 TSJacobianEval 190 1.0 nan nan 3.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00 24 3 0 0 0 24 3 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 97 MatMult 1930 1.0 nan nan 4.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 41 0 0 0 1 41 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 MatMultTranspose 1 1.0 nan nan 3.44e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 MatSolve 965 1.0 nan nan 5.04e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatSOR 965 1.0 nan nan 3.33e+08 1.0 0.0e+00 0.0e+00 0.0e+00 4 31 0 0 0 4 31 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatLUFactorSym 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatLUFactorNum 190 1.0 nan nan 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatScale 190 1.0 nan nan 3.26e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 MatAssemblyBegin 761 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatAssemblyEnd 761 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatGetRowIJ 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatCreateSubMats 380 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatGetOrdering 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatZeroEntries 379 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatSetPreallCOO 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatSetValuesCOO 190 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 KSPSetUp 760 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 KSPSolve 190 1.0 5.8052e-01 1.0 9.30e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 86 0 0 0 10 86 0 0 0 1602 -nan 1 4.80e-03 0 0.00e+00 46 KSPGMRESOrthog 775 1.0 nan nan 2.27e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 SNESSolve 71 1.0 5.7117e+00 1.0 1.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00 95 99 0 0 0 95 99 0 0 0 188 -nan 1 4.80e-03 0 0.00e+00 53 SNESSetUp 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SNESFunctionEval 573 1.0 nan nan 2.23e+07 1.0 0.0e+00 0.0e+00 0.0e+00 60 2 0 0 0 60 2 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 SNESJacobianEval 190 1.0 nan nan 3.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00 24 3 0 0 0 24 3 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 97 SNESLineSearch 190 1.0 nan nan 1.05e+08 1.0 0.0e+00 0.0e+00 0.0e+00 53 10 0 0 0 53 10 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 PCSetUp 570 1.0 nan nan 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 11 0 0 0 2 11 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 PCApply 965 1.0 nan nan 6.14e+08 1.0 0.0e+00 0.0e+00 0.0e+00 8 57 0 0 0 8 57 0 0 0 -nan -nan 1 4.80e-03 0 0.00e+00 19 KSPSolve_FS_0 965 1.0 nan nan 3.33e+08 1.0 0.0e+00 0.0e+00 0.0e+00 4 31 0 0 0 4 31 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 KSPSolve_FS_1 965 1.0 nan nan 1.66e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 15 0 0 0 2 15 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 --- Event Stage 1: Unknown ------------------------------------------------------------------------------------------------------------------------ --------------------------------------- Object Type Creations Destructions. Reports information only for process 0. --- Event Stage 0: Main Stage Container 5 5 Distributed Mesh 2 2 Index Set 11 11 IS L to G Mapping 1 1 Star Forest Graph 7 7 Discrete System 2 2 Weak Form 2 2 Vector 49 49 TSAdapt 1 1 TS 1 1 DMTS 1 1 SNES 1 1 DMSNES 3 3 SNESLineSearch 1 1 Krylov Solver 4 4 DMKSP interface 1 1 Matrix 4 4 Preconditioner 4 4 Viewer 2 1 --- Event Stage 1: Unknown ======================================================================================================================== Average time to get PetscTime(): 3.14e-08 #PETSc Option Table entries: -log_view -log_view_gpu_times #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with 64 bit PetscInt Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 8 Configure options: PETSC_DIR=/home/4pf/repos/petsc PETSC_ARCH=arch-kokkos-cuda-no-tpls --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --with-cuda --with-debugging=0 --with-shared-libraries --prefix=/home/4pf/build/petsc/cuda-no-tpls/install --with-64-bit-indices --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --CUDAOPTFLAGS=-O3 --with-kokkos-dir=/home/4pf/build/kokkos/cuda/install --with-kokkos-kernels-dir=/home/4pf/build/kokkos-kernels/cuda-no-tpls/install ----------------------------------------- Libraries compiled on 2022-11-01 21:01:08 on PC0115427 Machine characteristics: Linux-5.15.0-52-generic-x86_64-with-glibc2.35 Using PETSc directory: /home/4pf/build/petsc/cuda-no-tpls/install Using PETSc arch: ----------------------------------------- Using C compiler: mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3 ----------------------------------------- Using include paths: -I/home/4pf/build/petsc/cuda-no-tpls/install/include -I/home/4pf/build/kokkos-kernels/cuda-no-tpls/install/include -I/home/4pf/build/kokkos/cuda/install/include -I/usr/local/cuda-11.8/include ----------------------------------------- Using C linker: mpicc Using libraries: -Wl,-rpath,/home/4pf/build/petsc/cuda-no-tpls/install/lib -L/home/4pf/build/petsc/cuda-no-tpls/install/lib -lpetsc -Wl,-rpath,/home/4pf/build/kokkos-kernels/cuda-no-tpls/install/lib -L/home/4pf/build/kokkos-kernels/cuda-no-tpls/install/lib -Wl,-rpath,/home/4pf/build/kokkos/cuda/install/lib -L/home/4pf/build/kokkos/cuda/install/lib -Wl,-rpath,/usr/local/cuda-11.8/lib64 -L/usr/local/cuda-11.8/lib64 -L/usr/local/cuda-11.8/lib64/stubs -lkokkoskernels -lkokkoscontainers -lkokkoscore -llapack -lblas -lm -lcudart -lnvToolsExt -lcufft -lcublas -lcusparse -lcusolver -lcurand -lcuda -lquadmath -lstdc++ -ldl ----------------------------------------- Philip Fackler Research Software Engineer, Application Engineering Group Advanced Computing Systems Research Section Computer Science and Mathematics Division Oak Ridge National Laboratory ________________________________ From: Junchao Zhang <junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> Sent: Tuesday, November 15, 2022 13:03 To: Fackler, Philip <fackle...@ornl.gov<mailto:fackle...@ornl.gov>> Cc: xolotl-psi-developm...@lists.sourceforge.net<mailto:xolotl-psi-developm...@lists.sourceforge.net> <xolotl-psi-developm...@lists.sourceforge.net<mailto:xolotl-psi-developm...@lists.sourceforge.net>>; petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> <petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>>; Blondel, Sophie <sblon...@utk.edu<mailto:sblon...@utk.edu>>; Roth, Philip <rot...@ornl.gov<mailto:rot...@ornl.gov>> Subject: Re: [EXTERNAL] Re: [petsc-users] Kokkos backend for Mat and Vec diverging when running on CUDA device. Can you paste -log_view result so I can see what functions are used? --Junchao Zhang On Tue, Nov 15, 2022 at 10:24 AM Fackler, Philip <fackle...@ornl.gov<mailto:fackle...@ornl.gov>> wrote: Yes, most (but not all) of our system test cases fail with the kokkos/cuda or cuda backends. All of them pass with the CPU-only kokkos backend. Philip Fackler Research Software Engineer, Application Engineering Group Advanced Computing Systems Research Section Computer Science and Mathematics Division Oak Ridge National Laboratory ________________________________ From: Junchao Zhang <junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> Sent: Monday, November 14, 2022 19:34 To: Fackler, Philip <fackle...@ornl.gov<mailto:fackle...@ornl.gov>> Cc: xolotl-psi-developm...@lists.sourceforge.net<mailto:xolotl-psi-developm...@lists.sourceforge.net> <xolotl-psi-developm...@lists.sourceforge.net<mailto:xolotl-psi-developm...@lists.sourceforge.net>>; petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> <petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>>; Blondel, Sophie <sblon...@utk.edu<mailto:sblon...@utk.edu>>; Zhang, Junchao <jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>>; Roth, Philip <rot...@ornl.gov<mailto:rot...@ornl.gov>> Subject: [EXTERNAL] Re: [petsc-users] Kokkos backend for Mat and Vec diverging when running on CUDA device. Hi, Philip, Sorry to hear that. It seems you could run the same code on CPUs but not no GPUs (with either petsc/Kokkos backend or petsc/cuda backend, is it right? --Junchao Zhang On Mon, Nov 14, 2022 at 12:13 PM Fackler, Philip via petsc-users <petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote: This is an issue I've brought up before (and discussed in-person with Richard). I wanted to bring it up again because I'm hitting the limits of what I know to do, and I need help figuring this out. The problem can be reproduced using Xolotl's "develop" branch built against a petsc build with kokkos and kokkos-kernels enabled. Then, either add the relevant kokkos options to the "petscArgs=" line in the system test parameter file(s), or just replace the system test parameter files with the ones from the "feature-petsc-kokkos" branch. See here the files that begin with "params_system_". Note that those files use the "kokkos" options, but the problem is similar using the corresponding cuda/cusparse options. I've already tried building kokkos-kernels with no TPLs and got slightly different results, but the same problem. Any help would be appreciated. Thanks, Philip Fackler Research Software Engineer, Application Engineering Group Advanced Computing Systems Research Section Computer Science and Mathematics Division Oak Ridge National Laboratory