Re: [petsc-dev] MatMult on Summit

Mills, Richard Tran via petsc-dev Mon, 23 Sep 2019 11:43:50 -0700

OK, I wrote to the OLCF Consultants and they told me that

* Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones.


and, from this I can conclude that

* If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of 
the cores in different resource sets will share L2/L3 cache.

* For the above case, in which I want 6 MPI ranks that don't share anything, I 
need to ask for 6 resource sets each with *2 cores* and 1 GPU each. When I ask 
for 2 cores, each resource set will consist of 2 cores that share L2/L3, so 
this is how you can get resource sets that don't share L2/L3 between them.

--Richard

On 9/23/19 11:10 AM, Mills, Richard Tran wrote:
To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard


On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
<petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)    Ratio over 2 ranks
1         29229.8       -
2         59091.0      1.0
4        112260.7      1.9
6        159852.8      2.7
8        194351.7      3.3
10       215841.0      3.7
12       232316.6      3.9
14       244615.7      4.1
16       254450.8      4.3
18       262185.7      4.4
20       267181.0      4.5
22       270290.4      4.6
24       221944.9      3.8
26       238302.8      4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
<bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

     For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

     Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> <jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks     Rate (MB/s)     Ratio over 2 ranks
> ------------------------------------------
> 2          59012.2834        1.00
> 4          70959.1475        1.20
> 6         106639.9837        1.81
> 8         138638.6929        2.35
> 10        171125.0873        2.90
> 12        196162.5197        3.32
> 14        215272.7810        3.65
> 16        229562.4040        3.89
> 18        242587.4913        4.11
> 20        251057.1731        4.25
> 22        258569.7794        4.38
> 24        265443.2924        4.50
> 26        266562.7872        4.52
> 28        267043.6367        4.53
> 30        266833.7212        4.52
> 32        267183.8474        4.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > <j...@jedbrown.org<mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult              100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145       0      0 0.00e+00    0 
> >> 0.00e+00  0
> >> VecScatterBegin      100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0     0       0      0 0.00e+00    0 
> >> 0.00e+00  0
> >> VecScatterEnd        100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 
> >> 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junchao,
> >>
> >>    Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of 
> >> the GPUs for the multiply for this problem size.
> >>
> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> >>> <mfad...@lbl.gov<mailto:mfad...@lbl.gov><mailto:mfad...@lbl.gov<mailto:mfad...@lbl.gov>>>
> >>>  wrote:
> >>>
> >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> >>> saturated at that point.
> >>>
> >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> >>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>>
> >>>  wrote:
> >>> Here are CPU version results on one node with 24 cores, 42 cores. Click 
> >>> the links for core layout.
> >>>
> >>> 24 MPI ranks, 
> >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>> MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> >>> 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0 
> >>> 0.00e+00  0
> >>> VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> >>> 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0 
> >>> 0.00e+00  0
> >>> VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> >>> 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0 
> >>> 0.00e+00  0
> >>>
> >>> 42 MPI ranks, 
> >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> >>> MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> >>> 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0 
> >>> 0.00e+00  0
> >>> VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> >>> 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0 
> >>> 0.00e+00  0
> >>> VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> >>> 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0 
> >>> 0.00e+00  0
> >>>
> >>> --Junchao Zhang
> >>>
> >>>
> >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
> >>> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>>
> >>>  wrote:
> >>>
> >>>  Junchao,
> >>>
> >>>   Very interesting. For completeness please run also 24 and 42 CPUs 
> >>> without the GPUs. Note that the default layout for CPU cores is not good. 
> >>> You will want 3 cores on each socket then 12 on each.
> >>>
> >>>  Thanks
> >>>
> >>>   Barry
> >>>
> >>>  Since Tim is one of our reviewers next week this is a very good test 
> >>> matrix :-)
> >>>
> >>>
> >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> >>>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>>
> >>>>  wrote:
> >>>>
> >>>> Click the links to visualize it.
> >>>>
> >>>> 6 ranks
> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >>>>
> >>>> 24 ranks
> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >>>>
> >>>> --Junchao Zhang
> >>>>
> >>>>
> >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> >>>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>>
> >>>>  wrote:
> >>>> Junchao,
> >>>>
> >>>> Can you share your 'jsrun' command so that we can see how you are 
> >>>> mapping things to resource sets?
> >>>>
> >>>> --Richard
> >>>>
> >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> >>>>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> >>>>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. 
> >>>>> I found MatMult was almost dominated by VecScatter in this simple test. 
> >>>>> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> >>>>> performance. But if I enabled Multi-Process Service on Summit and used 
> >>>>> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know 
> >>>>> why and have to profile it. I will also collect  data with multiple 
> >>>>> nodes. Are the matrix and tests proper?
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------
> >>>>> Event                Count      Time (sec)     Flop                     
> >>>>>          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   
> >>>>> - GpuToCpu - GPU
> >>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> >>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
> >>>>> Count   Size  %F
> >>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>>>> 6 MPI ranks (CPU version)
> >>>>> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 
> >>>>> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 
> >>>>> 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>>
> >>>>> 6 MPI ranks + 6 GPUs + regular SF
> >>>>> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 
> >>>>> 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 
> >>>>> 1.02e+02  100 2.69e+02 100
> >>>>> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 
> >>>>> 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 
> >>>>> 0.00e+00  100 2.69e+02  0
> >>>>> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 
> >>>>> 1.02e+02    0 0.00e+00  0
> >>>>> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 
> >>>>> 0.00e+00  100 2.69e+02  0
> >>>>>
> >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> >>>>> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 
> >>>>> 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 
> >>>>> 0.00e+00    0 0.00e+00 100
> >>>>> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 
> >>>>> 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>>
> >>>>> 24 MPI ranks + 6 GPUs + regular SF
> >>>>> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 
> >>>>> 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 
> >>>>> 4.61e+01  100 6.72e+01 100
> >>>>> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 
> >>>>> 5.9e+04 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 
> >>>>> 0.00e+00  100 6.72e+01  0
> >>>>> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 
> >>>>> 4.61e+01    0 0.00e+00  0
> >>>>> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 
> >>>>> 0.00e+00  100 6.72e+01  0
> >>>>>
> >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF
> >>>>> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 
> >>>>> 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 
> >>>>> 0.00e+00    0 0.00e+00 100
> >>>>> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 
> >>>>> 5.9e+04 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 
> >>>>> 0.0e+00 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 
> >>>>> 0.00e+00    0 0.00e+00  0
> >>>>>
> >>>>>
> >>>>> --Junchao Zhang
> >>>>
> >>>
>
> <SummitNode.png>

Re: [petsc-dev] MatMult on Summit

Reply via email to