Here except for VecNorm the GPU is used effectively in that most of the time is time is spent doing real work on the GPU
VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 4.0e+02 0 1 0 0 20 9 1 0 0 33 30230 225393 0 0.00e+00 0 0.00e+00 100 Even the dots are very effective, only the VecNorm flop rate over the full time is much much lower than the vecdot. Which is somehow due to the use of the GPU or CPU MPI in the allreduce? > On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov> wrote: > > > > Mark, can we compare with Spock? > > Looks much better. This puts two processes/GPU because there are only 4. > <jac_out_001_kokkos_Spock_6_1_notpl.txt>