Also, do you guys have an OLCF liaison? That's actually your better bet if you do.
Performance issues with ROCm/Kokkos are pretty common in apps besides just PETSc. We have several teams actively working on rectifying this. However, I think performance issues can be quicker to identify if we had a more "official" and reproducible PETSc GPU benchmark, which I've already expressed to some folks in this thread, and as others already commented on the difficulty of such a task. Hopefully I will have more time soon to illustrate what I am thinking. On Mon, Jan 24, 2022 at 1:57 PM Justin Chang <jychan...@gmail.com> wrote: > My name has been called. > > Mark, if you're having issues with Crusher, please contact Veronica > Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in > those emails > > On Mon, Jan 24, 2022 at 1:49 PM Barry Smith <bsm...@petsc.dev> wrote: > >> >> >> On Jan 24, 2022, at 2:46 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could >> run this on one processor to get cleaner numbers. >> >> Is there a designated ECP technical support contact? >> >> >> Mark, you've forgotten you work for DOE. There isn't a non-ECP >> technical support contact. >> >> But if this is an AMD machine then maybe contact Matt's student Justin >> Chang? >> >> >> >> >> >> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> I think you should contact the crusher ECP technical support team and >>> tell them you are getting dismel performance and ask if you should expect >>> better. Don't waste time flogging a dead horse. >>> >>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knep...@gmail.com> wrote: >>> >>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfad...@lbl.gov> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <junchao.zh...@gmail.com> >>>>> wrote: >>>>> >>>>>> Mark, I think you can benchmark individual vector operations, and >>>>>> once we get reasonable profiling results, we can move to solvers etc. >>>>>> >>>>> >>>>> Can you suggest a code to run or are you suggesting making a vector >>>>> benchmark code? >>>>> >>>> Make a vector benchmark code, testing vector operations that would be >>>> used in your solver. >>>> Also, we can run MatMult() to see if the profiling result is reasonable. >>>> Only once we get some solid results on basic operations, it is useful >>>> to run big codes. >>>> >>> >>> So we have to make another throw-away code? Why not just look at the >>> vector ops in Mark's actual code? >>> >>> Matt >>> >>> >>>> >>>>> >>>>>> >>>>>> --Junchao Zhang >>>>>> >>>>>> >>>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfad...@lbl.gov> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsm...@petsc.dev> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> Here except for VecNorm the GPU is used effectively in that most >>>>>>>> of the time is time is spent doing real work on the GPU >>>>>>>> >>>>>>>> VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 >>>>>>>> 0.0e+00 4.0e+02 0 1 0 0 20 9 1 0 0 33 30230 225393 0 >>>>>>>> 0.00e+00 0 0.00e+00 100 >>>>>>>> >>>>>>>> Even the dots are very effective, only the VecNorm flop rate over >>>>>>>> the full time is much much lower than the vecdot. Which is somehow due >>>>>>>> to >>>>>>>> the use of the GPU or CPU MPI in the allreduce? >>>>>>>> >>>>>>> >>>>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate >>>>>>> is about the same as the other vec ops. I don't know what to make of >>>>>>> that. >>>>>>> >>>>>>> But Crusher is clearly not crushing it. >>>>>>> >>>>>>> Junchao: Perhaps we should ask Kokkos if they have any experience >>>>>>> with Crusher that they can share. They could very well find some low >>>>>>> level >>>>>>> magic. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Mark, can we compare with Spock? >>>>>>>>> >>>>>>>> >>>>>>>> Looks much better. This puts two processes/GPU because there are >>>>>>>> only 4. >>>>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt> >>>>>>>> >>>>>>>> >>>>>>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >>> >>> >>