The figure did not clearly say all cores share L3. Instead, we should look at p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf
"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, and an interconnection system that connects all components within the chip at 7 TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 embedded DRAM (eDRAM)." --Junchao Zhang On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF output from a Summit compute node to see an illustration of the node layout. --Richard On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote: I also did OpenMP stream test and then I found mismatch between OpenMPI and MPI. That reminded me a subtle issue on summit: pair of cores share L2 cache. One has to place MPI ranks to different pairs to get best bandwidth. See different bindings https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each node has 21 cores. I assume that means 11 pairs. The new results are below. They match with we what I got from OpenMPI. The bandwidth is almost doubled from 1 to 2 cores per socket. IBM document also says each socket has two memory controllers. I could not find the core-memory controller affinity info. I tried different bindings but did not find huge difference. #Ranks Rate (MB/s) Ratio over 2 ranks 1 29229.8 - 2 59091.0 1.0 4 112260.7 1.9 6 159852.8 2.7 8 194351.7 3.3 10 215841.0 3.7 12 232316.6 3.9 14 244615.7 4.1 16 254450.8 4.3 18 262185.7 4.4 20 267181.0 4.5 22 270290.4 4.6 24 221944.9 3.8 26 238302.8 4.0 --Junchao Zhang On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote: Junchao, For completeness could you please run with a single core? But leave the ratio as you have with over 2 ranks since that is the correct model. Thanks Barry > On Sep 22, 2019, at 11:14 AM, Zhang, Junchao > <jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote: > > I did stream test on Summit. I used the MPI version from petsc, but largely > increased the array size N since one socket of Summit has 120MB L3 cache. I > used MPI version since it was easy for me to distribute ranks evenly to the > two sockets. > The result matches with data released by OLCF (see attached figure) and data > given by Jed. We can see the bandwidth saturates around 24 ranks. > > #Ranks Rate (MB/s) Ratio over 2 ranks > ------------------------------------------ > 2 59012.2834 1.00 > 4 70959.1475 1.20 > 6 106639.9837 1.81 > 8 138638.6929 2.35 > 10 171125.0873 2.90 > 12 196162.5197 3.32 > 14 215272.7810 3.65 > 16 229562.4040 3.89 > 18 242587.4913 4.11 > 20 251057.1731 4.25 > 22 258569.7794 4.38 > 24 265443.2924 4.50 > 26 266562.7872 4.52 > 28 267043.6367 4.53 > 30 266833.7212 4.52 > 32 267183.8474 4.53 > > On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote: > > Junchao could try the PETSc (and non-PETSc) streams tests on the machine. > > There are a few differences, compiler, the reported results are with > OpenMP, different number of cores but yes the performance is a bit low. For > DOE that is great, makes GPUs look better :-) > > > > On Sep 21, 2019, at 11:11 PM, Jed Brown > > <j...@jedbrown.org<mailto:j...@jedbrown.org>> wrote: > > > > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165 > > GB/s for the node for the best case (42 ranks). > > > > My understanding is that these systems have 8 channels of DDR4-2666 per > > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket > > system, and 270 GB/s STREAM Triad according to this post > > > > > > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/ > > > > Is this 60% of Triad the best we can get for SpMV? > > > > "Zhang, Junchao via petsc-dev" > > <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> writes: > > > >> 42 cores have better performance. > >> > >> 36 MPI ranks > >> MatMult 100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 > >> 0.0e+00 6 99 97 28 0 100100100100 0 25145 0 0 0.00e+00 0 > >> 0.00e+00 0 > >> VecScatterBegin 100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 > >> 0.0e+00 0 0 97 28 0 1 0100100 0 0 0 0 0.00e+00 0 > >> 0.00e+00 0 > >> VecScatterEnd 100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 > >> 0.0e+00 1 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 > >> 0.00e+00 0 > >> > >> --Junchao Zhang > >> > >> > >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. > >> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>> > >> wrote: > >> > >> Junchao, > >> > >> Mark has a good point; could you also try for completeness the CPU with > >> 36 cores and see if it is any better than the 42 core case? > >> > >> Barry > >> > >> So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of > >> the GPUs for the multiply for this problem size. > >> > >>> On Sep 21, 2019, at 6:40 PM, Mark Adams > >>> <mfad...@lbl.gov<mailto:mfad...@lbl.gov><mailto:mfad...@lbl.gov<mailto:mfad...@lbl.gov>>> > >>> wrote: > >>> > >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty > >>> saturated at that point. > >>> > >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev > >>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> > >>> wrote: > >>> Here are CPU version results on one node with 24 cores, 42 cores. Click > >>> the links for core layout. > >>> > >>> 24 MPI ranks, > >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= > >>> MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 > >>> 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 > >>> 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 > >>> 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> > >>> 42 MPI ranks, > >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0= > >>> MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 > >>> 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 > >>> 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 > >>> 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> > >>> --Junchao Zhang > >>> > >>> > >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. > >>> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>> > >>> wrote: > >>> > >>> Junchao, > >>> > >>> Very interesting. For completeness please run also 24 and 42 CPUs > >>> without the GPUs. Note that the default layout for CPU cores is not good. > >>> You will want 3 cores on each socket then 12 on each. > >>> > >>> Thanks > >>> > >>> Barry > >>> > >>> Since Tim is one of our reviewers next week this is a very good test > >>> matrix :-) > >>> > >>> > >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev > >>>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> > >>>> wrote: > >>>> > >>>> Click the links to visualize it. > >>>> > >>>> 6 ranks > >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0= > >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU > >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f > >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view > >>>> > >>>> 24 ranks > >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= > >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU > >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f > >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view > >>>> > >>>> --Junchao Zhang > >>>> > >>>> > >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev > >>>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> > >>>> wrote: > >>>> Junchao, > >>>> > >>>> Can you share your 'jsrun' command so that we can see how you are > >>>> mapping things to resource sets? > >>>> > >>>> --Richard > >>>> > >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote: > >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix > >>>>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 > >>>>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. > >>>>> I found MatMult was almost dominated by VecScatter in this simple test. > >>>>> Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve > >>>>> performance. But if I enabled Multi-Process Service on Summit and used > >>>>> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know > >>>>> why and have to profile it. I will also collect data with multiple > >>>>> nodes. Are the matrix and tests proper? > >>>>> > >>>>> ------------------------------------------------------------------------------------------------------------------------ > >>>>> Event Count Time (sec) Flop > >>>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - > >>>>> - GpuToCpu - GPU > >>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen > >>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size > >>>>> Count Size %F > >>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- > >>>>> 6 MPI ranks (CPU version) > >>>>> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 > >>>>> 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 > >>>>> 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> > >>>>> 6 MPI ranks + 6 GPUs + regular SF > >>>>> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 > >>>>> 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 > >>>>> 1.02e+02 100 2.69e+02 100 > >>>>> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 > >>>>> 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 > >>>>> 0.00e+00 100 2.69e+02 0 > >>>>> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 > >>>>> 1.02e+02 0 0.00e+00 0 > >>>>> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 > >>>>> 0.00e+00 100 2.69e+02 0 > >>>>> > >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF > >>>>> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 > >>>>> 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 > >>>>> 0.00e+00 0 0.00e+00 100 > >>>>> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 > >>>>> 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> > >>>>> 24 MPI ranks + 6 GPUs + regular SF > >>>>> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 > >>>>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 > >>>>> 4.61e+01 100 6.72e+01 100 > >>>>> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 > >>>>> 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 > >>>>> 0.00e+00 100 6.72e+01 0 > >>>>> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 > >>>>> 4.61e+01 0 0.00e+00 0 > >>>>> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 > >>>>> 0.00e+00 100 6.72e+01 0 > >>>>> > >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF > >>>>> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 > >>>>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 > >>>>> 0.00e+00 0 0.00e+00 100 > >>>>> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 > >>>>> 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> > >>>>> > >>>>> --Junchao Zhang > >>>> > >>> > > <SummitNode.png>