Re: [petsc-dev] Kokkos/Crusher perforance

Jed Brown Tue, 25 Jan 2022 08:56:09 -0800

Barry Smith <bsm...@petsc.dev> writes:

>   Thanks Mark, far more interesting. I've improved the formatting to make it 
> easier to read (and fixed width font for email reading)
>
>   * Can you do same run with say 10 iterations of Jacobi PC?
>
>   * PCApply performance (looks like GAMG) is terrible! Problems too small?


This is -pc_type jacobi.

>   * VecScatter time is completely dominated by SFPack! Junchao what's up with 
> that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
> where that is coming from.

It's all in MatMult.

I'd like to see a run that doesn't wait for the GPU.

> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - 
> GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> MatMult              200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
> 1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212      1 2.04e-04    
> 0 0.00e+00 100
> KSPSolve               1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
> 6.0e+02 12100 99 79 94 100100100100100 449,667    893,741      1 2.04e-04    
> 0 0.00e+00 100
> PCApply              201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+00  2  1  0  0  0  18  1  0  0  0  14,558    16,3941      0 0.00e+00    
> 0 0.00e+00 100
> VecTDot              401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716    353,914      0 0.00e+00    
> 0 0.00e+00 100
> VecNorm              201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325    303,155      0 0.00e+00    
> 0 0.00e+00 100
> VecAXPY              400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091    514,744      0 0.00e+00    
> 0 0.00e+00 100
> VecAYPX              199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323    532,889      0 0.00e+00    
> 0 0.00e+00 100
> VecPointwiseMult     201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882    290,088      0 0.00e+00    
> 0 0.00e+00 100
> VecScatterBegin      200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
> 1.0e+00  2  0 99 79  0  19  0100100  0       0          0      1 2.04e-04    
> 0 0.00e+00  0
> VecScatterEnd        200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0       0          0      0 0.00e+00    
> 0 0.00e+00  0
> SFPack               200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  18  0  0  0  0       0          0      1 2.04e-04    
> 0 0.00e+00  0
> SFUnpack             200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0       0          0      0 0.00e+00    
> 0 0.00e+00  0
>
>
>> On Jan 25, 2022, at 8:29 AM, Mark Adams <mfad...@lbl.gov> wrote:
>> 
>> adding Suyash,
>> 
>> I found the/a problem. Using ex56, which has a crappy decomposition, using 
>> one MPI process/GPU is much faster than using 8 (64 total). (I am looking at 
>> ex13 to see how much of this is due to the decomposition)
>> If you only use 8 processes it seems that all 8 are put on the first GPU, 
>> but adding -c8 seems to fix this.
>> Now the numbers are looking reasonable.
>> 
>> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>   For this, to start, someone can run 
>> 
>> src/vec/vec/tutorials/performance.c 
>> 
>> and compare the performance to that in the technical report Evaluation of 
>> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
>> Vector Node Performance. Google to find. One does not have to and shouldn't 
>> do an extensive study right now that compares everything, instead one should 
>> run a very small number of different size problems (make them big) and 
>> compare those sizes with what Summit gives. Note you will need to make sure 
>> that performance.c uses the Kokkos backend.
>> 
>>   One hopes for better performance than Summit; if one gets tons worse we 
>> know something is very wrong somewhere. I'd love to see some comparisons.
>> 
>>   Barry
>> 
>> 
>>> On Jan 24, 2022, at 3:06 PM, Justin Chang <jychan...@gmail.com 
>>> <mailto:jychan...@gmail.com>> wrote:
>>> 
>>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>>> you do. 
>>> 
>>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>>> PETSc. We have several teams actively working on rectifying this. However, 
>>> I think performance issues can be quicker to identify if we had a more 
>>> "official" and reproducible PETSc GPU benchmark, which I've already 
>>> expressed to some folks in this thread, and as others already commented on 
>>> the difficulty of such a task. Hopefully I will have more time soon to 
>>> illustrate what I am thinking.
>>> 
>>> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang <jychan...@gmail.com 
>>> <mailto:jychan...@gmail.com>> wrote:
>>> My name has been called.
>>> 
>>> Mark, if you're having issues with Crusher, please contact Veronica Vergara 
>>> (vergar...@ornl.gov <mailto:vergar...@ornl.gov>). You can cc me 
>>> (justin.ch...@amd.com <mailto:justin.ch...@amd.com>) in those emails
>>> 
>>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>> 
>>>> On Jan 24, 2022, at 2:46 PM, Mark Adams <mfad...@lbl.gov 
>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>> 
>>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could 
>>>> run this on one processor to get cleaner numbers.
>>>> 
>>>> Is there a designated ECP technical support contact?
>>> 
>>>    Mark, you've forgotten you work for DOE. There isn't a non-ECP technical 
>>> support contact. 
>>> 
>>>    But if this is an AMD machine then maybe contact Matt's student Justin 
>>> Chang?
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsm...@petsc.dev 
>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>>   I think you should contact the crusher ECP technical support team and 
>>>> tell them you are getting dismel performance and ask if you should expect 
>>>> better. Don't waste time flogging a dead horse. 
>>>> 
>>>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knep...@gmail.com 
>>>>> <mailto:knep...@gmail.com>> wrote:
>>>>> 
>>>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zh...@gmail.com 
>>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>> 
>>>>> 
>>>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <junchao.zh...@gmail.com 
>>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>>> Mark, I think you can benchmark individual vector operations, and once we 
>>>>> get reasonable profiling results, we can move to solvers etc.
>>>>> 
>>>>> Can you suggest a code to run or are you suggesting making a vector 
>>>>> benchmark code?
>>>>> Make a vector benchmark code, testing vector operations that would be 
>>>>> used in your solver.
>>>>> Also, we can run MatMult() to see if the profiling result is reasonable.
>>>>> Only once we get some solid results on basic operations, it is useful to 
>>>>> run big codes.
>>>>> 
>>>>> So we have to make another throw-away code? Why not just look at the 
>>>>> vector ops in Mark's actual code?
>>>>> 
>>>>>    Matt
>>>>>  
>>>>>  
>>>>> 
>>>>> --Junchao Zhang
>>>>> 
>>>>> 
>>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>> 
>>>>> 
>>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsm...@petsc.dev 
>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>> 
>>>>>   Here except for VecNorm the GPU is used effectively in that most of the 
>>>>> time is time is spent doing real work on the GPU
>>>>> 
>>>>> VecNorm              402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
>>>>> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393      0 0.00e+00    
>>>>> 0 0.00e+00 100
>>>>> 
>>>>> Even the dots are very effective, only the VecNorm flop rate over the 
>>>>> full time is much much lower than the vecdot. Which is somehow due to the 
>>>>> use of the GPU or CPU MPI in the allreduce?
>>>>> 
>>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is 
>>>>> about the same as the other vec ops. I don't know what to make of that.
>>>>> 
>>>>> But Crusher is clearly not crushing it. 
>>>>> 
>>>>> Junchao: Perhaps we should ask Kokkos if they have any experience with 
>>>>> Crusher that they can share. They could very well find some low level 
>>>>> magic.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov 
>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Mark, can we compare with Spock?
>>>>>> 
>>>>>>  Looks much better. This puts two processes/GPU because there are only 4.
>>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt>
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> What most experimenters take for granted before they begin their 
>>>>> experiments is infinitely more interesting than any results to which 
>>>>> their experiments lead.
>>>>> -- Norbert Wiener
>>>>> 
>>>>> https://www.cse.buffalo.edu/~knepley/ 
>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>> 
>>> 
>> 
>> <jac_out_001_kokkos_Crusher_159_1.txt>

Re: [petsc-dev] Kokkos/Crusher perforance

Reply via email to