> I suggested years ago that -log_view automatically print useful information > about the GPU setup (when GPUs are used) but everyone seemed comfortable with > the lack of information so no one improved it.
FWIW, PetscDeviceView() does a bit of what you want (it just dumps all of cuda/hipDeviceProp) Best regards, Jacob Faibussowitsch (Jacob Fai - booss - oh - vitch) > On Jan 22, 2022, at 12:55, Barry Smith <bsm...@petsc.dev> wrote: > > > I suggested years ago that -log_view automatically print useful information > about the GPU setup (when GPUs are used) but everyone seemed comfortable with > the lack of information so no one improved it. I think for a small number of > GPUs -log_view should just print details and for a larger number print some > statistics (how many physical ones etc). Currently, it does not even print > how many are used. I think requiring another option to get this basic > information is a mistake, we already print a ton of background with -log_view > it is just sad no background on the GPU usage. > > > > > >> On Jan 22, 2022, at 1:06 PM, Jed Brown <j...@jedbrown.org> wrote: >> >> Mark Adams <mfad...@lbl.gov> writes: >> >>> On Sat, Jan 22, 2022 at 12:29 PM Jed Brown <j...@jedbrown.org> wrote: >>> >>>> Mark Adams <mfad...@lbl.gov> writes: >>>> >>>>>> >>>>>> >>>>>> >>>>>>> VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 >>>> 0.0e+00 >>>>>> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+00 >>>> 0 >>>>>> 0.00e+00 100 >>>>>>> VecScatterBegin 400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 >>>> 1.6e+04 >>>>>> 0.0e+00 0 0 62 54 0 2 0100100 0 0 0 0 0.00e+00 >>>> 0 >>>>>> 0.00e+00 0 >>>>>>> VecScatterEnd 400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>>>> 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 >>>> 0 >>>>>> 0.00e+00 0 >>>>>>> PCApply 402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00 >>>> 0.0e+00 >>>>>> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22490 70608 0 0.00e+00 >>>> 0 >>>>>> 0.00e+00 100 >>>>>> >>>>>> Most of the MatMult time is attributed to VecScatterEnd here. Can you >>>>>> share a run of the same total problem size on 8 ranks (one rank per >>>> GPU)? >>>>>> >>>>>> >>>>> attached. I ran out of memory with the same size problem so this is the >>>>> 262K / GPU version. >>>> >>>> How was this launched? Is it possible all 8 ranks were using the same GPU? >>>> (Perf is that bad.) >>>> >>> >>> srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13 >>> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2 >>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 >>> -dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type >>> jacobi -log_view -ksp_view -use_gpu_aware_mpi true >> >> I'm still worried because the results are so unreasonable. We should add an >> option like -view_gpu_busid that prints this information per rank. >> >> https://code.ornl.gov/olcf/hello_jobstep/-/blob/master/hello_jobstep.cpp >> >> A single-process/single-GPU comparison would also be a useful point of >> comparison. >