Barry Smith <bsm...@petsc.dev> writes: > Thanks Mark, far more interesting. I've improved the formatting to make it > easier to read (and fixed width font for email reading) > > * Can you do same run with say 10 iterations of Jacobi PC? > > * PCApply performance (looks like GAMG) is terrible! Problems too small?
This is -pc_type jacobi. > * VecScatter time is completely dominated by SFPack! Junchao what's up with > that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify > where that is coming from. It's all in MatMult. I'd like to see a run that doesn't wait for the GPU. > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - > GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count > Size %F > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > MatMult 200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 > 1.0e+00 9 92 99 79 0 71 92100100 0 579,635 1,014,212 1 2.04e-04 > 0 0.00e+00 100 > KSPSolve 1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 > 6.0e+02 12100 99 79 94 100100100100100 449,667 893,741 1 2.04e-04 > 0 0.00e+00 100 > PCApply 201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 > 2.0e+00 2 1 0 0 0 18 1 0 0 0 14,558 16,3941 0 0.00e+00 > 0 0.00e+00 100 > VecTDot 401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 > 4.0e+02 1 2 0 0 62 5 2 0 0 66 183,716 353,914 0 0.00e+00 > 0 0.00e+00 100 > VecNorm 201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 > 2.0e+02 0 1 0 0 31 2 1 0 0 33 222,325 303,155 0 0.00e+00 > 0 0.00e+00 100 > VecAXPY 400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 2 0 0 0 2 2 0 0 0 427,091 514,744 0 0.00e+00 > 0 0.00e+00 100 > VecAYPX 199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 1 1 0 0 0 432,323 532,889 0 0.00e+00 > 0 0.00e+00 100 > VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 1 1 0 0 0 235,882 290,088 0 0.00e+00 > 0 0.00e+00 100 > VecScatterBegin 200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 > 1.0e+00 2 0 99 79 0 19 0100100 0 0 0 1 2.04e-04 > 0 0.00e+00 0 > VecScatterEnd 200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 > 0 0.00e+00 0 > SFPack 200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 2 0 0 0 0 18 0 0 0 0 0 0 1 2.04e-04 > 0 0.00e+00 0 > SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 > 0 0.00e+00 0 > > >> On Jan 25, 2022, at 8:29 AM, Mark Adams <mfad...@lbl.gov> wrote: >> >> adding Suyash, >> >> I found the/a problem. Using ex56, which has a crappy decomposition, using >> one MPI process/GPU is much faster than using 8 (64 total). (I am looking at >> ex13 to see how much of this is due to the decomposition) >> If you only use 8 processes it seems that all 8 are put on the first GPU, >> but adding -c8 seems to fix this. >> Now the numbers are looking reasonable. >> >> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith <bsm...@petsc.dev >> <mailto:bsm...@petsc.dev>> wrote: >> >> For this, to start, someone can run >> >> src/vec/vec/tutorials/performance.c >> >> and compare the performance to that in the technical report Evaluation of >> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: >> Vector Node Performance. Google to find. One does not have to and shouldn't >> do an extensive study right now that compares everything, instead one should >> run a very small number of different size problems (make them big) and >> compare those sizes with what Summit gives. Note you will need to make sure >> that performance.c uses the Kokkos backend. >> >> One hopes for better performance than Summit; if one gets tons worse we >> know something is very wrong somewhere. I'd love to see some comparisons. >> >> Barry >> >> >>> On Jan 24, 2022, at 3:06 PM, Justin Chang <jychan...@gmail.com >>> <mailto:jychan...@gmail.com>> wrote: >>> >>> Also, do you guys have an OLCF liaison? That's actually your better bet if >>> you do. >>> >>> Performance issues with ROCm/Kokkos are pretty common in apps besides just >>> PETSc. We have several teams actively working on rectifying this. However, >>> I think performance issues can be quicker to identify if we had a more >>> "official" and reproducible PETSc GPU benchmark, which I've already >>> expressed to some folks in this thread, and as others already commented on >>> the difficulty of such a task. Hopefully I will have more time soon to >>> illustrate what I am thinking. >>> >>> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang <jychan...@gmail.com >>> <mailto:jychan...@gmail.com>> wrote: >>> My name has been called. >>> >>> Mark, if you're having issues with Crusher, please contact Veronica Vergara >>> (vergar...@ornl.gov <mailto:vergar...@ornl.gov>). You can cc me >>> (justin.ch...@amd.com <mailto:justin.ch...@amd.com>) in those emails >>> >>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>> >>> >>>> On Jan 24, 2022, at 2:46 PM, Mark Adams <mfad...@lbl.gov >>>> <mailto:mfad...@lbl.gov>> wrote: >>>> >>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could >>>> run this on one processor to get cleaner numbers. >>>> >>>> Is there a designated ECP technical support contact? >>> >>> Mark, you've forgotten you work for DOE. There isn't a non-ECP technical >>> support contact. >>> >>> But if this is an AMD machine then maybe contact Matt's student Justin >>> Chang? >>> >>> >>> >>>> >>>> >>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsm...@petsc.dev >>>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> I think you should contact the crusher ECP technical support team and >>>> tell them you are getting dismel performance and ask if you should expect >>>> better. Don't waste time flogging a dead horse. >>>> >>>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knep...@gmail.com >>>>> <mailto:knep...@gmail.com>> wrote: >>>>> >>>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zh...@gmail.com >>>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>>> >>>>> >>>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> >>>>> >>>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <junchao.zh...@gmail.com >>>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>>> Mark, I think you can benchmark individual vector operations, and once we >>>>> get reasonable profiling results, we can move to solvers etc. >>>>> >>>>> Can you suggest a code to run or are you suggesting making a vector >>>>> benchmark code? >>>>> Make a vector benchmark code, testing vector operations that would be >>>>> used in your solver. >>>>> Also, we can run MatMult() to see if the profiling result is reasonable. >>>>> Only once we get some solid results on basic operations, it is useful to >>>>> run big codes. >>>>> >>>>> So we have to make another throw-away code? Why not just look at the >>>>> vector ops in Mark's actual code? >>>>> >>>>> Matt >>>>> >>>>> >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> >>>>> >>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsm...@petsc.dev >>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>> >>>>> Here except for VecNorm the GPU is used effectively in that most of the >>>>> time is time is spent doing real work on the GPU >>>>> >>>>> VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 >>>>> 4.0e+02 0 1 0 0 20 9 1 0 0 33 30230 225393 0 0.00e+00 >>>>> 0 0.00e+00 100 >>>>> >>>>> Even the dots are very effective, only the VecNorm flop rate over the >>>>> full time is much much lower than the vecdot. Which is somehow due to the >>>>> use of the GPU or CPU MPI in the allreduce? >>>>> >>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is >>>>> about the same as the other vec ops. I don't know what to make of that. >>>>> >>>>> But Crusher is clearly not crushing it. >>>>> >>>>> Junchao: Perhaps we should ask Kokkos if they have any experience with >>>>> Crusher that they can share. They could very well find some low level >>>>> magic. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov >>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Mark, can we compare with Spock? >>>>>> >>>>>> Looks much better. This puts two processes/GPU because there are only 4. >>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt> >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which >>>>> their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> <http://www.cse.buffalo.edu/~knepley/> >>>> >>> >> >> <jac_out_001_kokkos_Crusher_159_1.txt>