Mark, I think you can benchmark individual vector operations, and once we get reasonable profiling results, we can move to solvers etc.
--Junchao Zhang On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfad...@lbl.gov> wrote: > > > On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsm...@petsc.dev> wrote: > >> >> Here except for VecNorm the GPU is used effectively in that most of the >> time is time is spent doing real work on the GPU >> >> VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 >> 4.0e+02 0 1 0 0 20 9 1 0 0 33 30230 225393 0 0.00e+00 0 >> 0.00e+00 100 >> >> Even the dots are very effective, only the VecNorm flop rate over the >> full time is much much lower than the vecdot. Which is somehow due to the >> use of the GPU or CPU MPI in the allreduce? >> > > The VecNorm GPU rate is relatively high on Crusher and the CPU rate is > about the same as the other vec ops. I don't know what to make of that. > > But Crusher is clearly not crushing it. > > Junchao: Perhaps we should ask Kokkos if they have any experience with > Crusher that they can share. They could very well find some low level magic. > > > >> >> >> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> >> >>> Mark, can we compare with Spock? >>> >> >> Looks much better. This puts two processes/GPU because there are only 4. >> <jac_out_001_kokkos_Spock_6_1_notpl.txt> >> >> >>