On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfad...@lbl.gov> wrote: > I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian > (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it > MI200?). > This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI > are similar (mat-vec is a little faster w/o, the total is about the same, > call it noise) > > I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 > cores on the node, then when using 1 core/GPU. With the same size problem > of course. > I was thinking MatMult should be faster with just one MPI process. Oh > well, worry about that later. > > The bigger problem, and I have observed this to some extent with the > Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are > expensive or crazy expensive. > You can see (attached) and the times here that the solve is dominated by > not-mat-vec: > > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- *Total GPU * - CpuToGpu - - > GpuToCpu - GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R *Mflop/s Mflop/s* Count Size > Count Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ > grep "MatMult 400" jac_out_00*5_8_gpuawaremp* > MatMult 400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05 > 1.6e+04 0.0e+00 1 55 62 54 0 27 91100100 0 *668874 0* 0 > 0.00e+00 0 0.00e+00 100 > 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ > grep "KSPSolve 2" jac_out_001*_5_8_gpuawaremp* > KSPSolve 2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05 > 1.6e+04 1.2e+03 4 60 62 54 61 100100100100100 *208923 1094405* 0 > 0.00e+00 0 0.00e+00 100 > > Notes about flop counters here, > * that MatMult flops are not logged as GPU flops but something is logged > nonetheless. > * The GPU flop rate is 5x the total flop rate in KSPSolve :\ > * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are > at < 1%. >
This looks complicated, so just a single remark: My understanding of the benchmarking of vector ops led by Hannah was that you needed to be much bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I would think you would be at 10% of peak or something right off the bat at these sizes. Barry, is that right? Thanks, Matt > Anway, not sure how to proceed but I thought I would share. > Maybe ask the Kokkos guys if the have looked at Crusher. > > Mark > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>