OK, I wrote to the OLCF Consultants and they told me that * Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones.
and, from this I can conclude that * If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of the cores in different resource sets will share L2/L3 cache. * For the above case, in which I want 6 MPI ranks that don't share anything, I need to ask for 6 resource sets each with *2 cores* and 1 GPU each. When I ask for 2 cores, each resource set will consist of 2 cores that share L2/L3, so this is how you can get resource sets that don't share L2/L3 between them. --Richard On 9/23/19 11:10 AM, Mills, Richard Tran wrote: To further muddy the waters, the OLCF Summit User Guide (https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide) states that "The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The processor provides 22 SMCs with separate 32kB L1 data and instruction caches. Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache." And there is some funny stuff in that lstopo output. On the first socket, I see one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually shares this with a "service" node that is hidden to jsrun. But why are there three such SMCs on the second socket?! I've written to the OLCF Consultants to see if they can provide any clarification on this. In particular, I want to know if the jsrun Visualizer hardware thread and core numberings correspond to the lstopo ones. I think that's the only way to tell if we are getting cores that don't share L2/L3 resources or not. --Richard On 9/23/19 10:58 AM, Zhang, Junchao wrote: The figure did not clearly say all cores share L3. Instead, we should look at p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf "The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, and an interconnection system that connects all components within the chip at 7 TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 embedded DRAM (eDRAM)." --Junchao Zhang On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF output from a Summit compute node to see an illustration of the node layout. --Richard On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote: I also did OpenMP stream test and then I found mismatch between OpenMPI and MPI. That reminded me a subtle issue on summit: pair of cores share L2 cache. One has to place MPI ranks to different pairs to get best bandwidth. See different bindings https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each node has 21 cores. I assume that means 11 pairs. The new results are below. They match with we what I got from OpenMPI. The bandwidth is almost doubled from 1 to 2 cores per socket. IBM document also says each socket has two memory controllers. I could not find the core-memory controller affinity info. I tried different bindings but did not find huge difference. #Ranks Rate (MB/s) Ratio over 2 ranks 1 29229.8 - 2 59091.0 1.0 4 112260.7 1.9 6 159852.8 2.7 8 194351.7 3.3 10 215841.0 3.7 12 232316.6 3.9 14 244615.7 4.1 16 254450.8 4.3 18 262185.7 4.4 20 267181.0 4.5 22 270290.4 4.6 24 221944.9 3.8 26 238302.8 4.0 --Junchao Zhang On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote: Junchao, For completeness could you please run with a single core? But leave the ratio as you have with over 2 ranks since that is the correct model. Thanks Barry > On Sep 22, 2019, at 11:14 AM, Zhang, Junchao > <jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote: > > I did stream test on Summit. I used the MPI version from petsc, but largely > increased the array size N since one socket of Summit has 120MB L3 cache. I > used MPI version since it was easy for me to distribute ranks evenly to the > two sockets. > The result matches with data released by OLCF (see attached figure) and data > given by Jed. We can see the bandwidth saturates around 24 ranks. > > #Ranks Rate (MB/s) Ratio over 2 ranks > ------------------------------------------ > 2 59012.2834 1.00 > 4 70959.1475 1.20 > 6 106639.9837 1.81 > 8 138638.6929 2.35 > 10 171125.0873 2.90 > 12 196162.5197 3.32 > 14 215272.7810 3.65 > 16 229562.4040 3.89 > 18 242587.4913 4.11 > 20 251057.1731 4.25 > 22 258569.7794 4.38 > 24 265443.2924 4.50 > 26 266562.7872 4.52 > 28 267043.6367 4.53 > 30 266833.7212 4.52 > 32 267183.8474 4.53 > > On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote: > > Junchao could try the PETSc (and non-PETSc) streams tests on the machine. > > There are a few differences, compiler, the reported results are with > OpenMP, different number of cores but yes the performance is a bit low. For > DOE that is great, makes GPUs look better :-) > > > > On Sep 21, 2019, at 11:11 PM, Jed Brown > > <j...@jedbrown.org<mailto:j...@jedbrown.org>> wrote: > > > > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165 > > GB/s for the node for the best case (42 ranks). > > > > My understanding is that these systems have 8 channels of DDR4-2666 per > > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket > > system, and 270 GB/s STREAM Triad according to this post > > > > > > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/ > > > > Is this 60% of Triad the best we can get for SpMV? > > > > "Zhang, Junchao via petsc-dev" > > <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> writes: > > > >> 42 cores have better performance. > >> > >> 36 MPI ranks > >> MatMult 100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 > >> 0.0e+00 6 99 97 28 0 100100100100 0 25145 0 0 0.00e+00 0 > >> 0.00e+00 0 > >> VecScatterBegin 100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 > >> 0.0e+00 0 0 97 28 0 1 0100100 0 0 0 0 0.00e+00 0 > >> 0.00e+00 0 > >> VecScatterEnd 100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 > >> 0.0e+00 1 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 > >> 0.00e+00 0 > >> > >> --Junchao Zhang > >> > >> > >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. > >> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>> > >> wrote: > >> > >> Junchao, > >> > >> Mark has a good point; could you also try for completeness the CPU with > >> 36 cores and see if it is any better than the 42 core case? > >> > >> Barry > >> > >> So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of > >> the GPUs for the multiply for this problem size. > >> > >>> On Sep 21, 2019, at 6:40 PM, Mark Adams > >>> <mfad...@lbl.gov<mailto:mfad...@lbl.gov><mailto:mfad...@lbl.gov<mailto:mfad...@lbl.gov>>> > >>> wrote: > >>> > >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty > >>> saturated at that point. > >>> > >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev > >>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> > >>> wrote: > >>> Here are CPU version results on one node with 24 cores, 42 cores. Click > >>> the links for core layout. > >>> > >>> 24 MPI ranks, > >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= > >>> MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 > >>> 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 > >>> 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 > >>> 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> > >>> 42 MPI ranks, > >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0= > >>> MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 > >>> 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 > >>> 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 > >>> 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 0.00e+00 0 > >>> 0.00e+00 0 > >>> > >>> --Junchao Zhang > >>> > >>> > >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. > >>> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>> > >>> wrote: > >>> > >>> Junchao, > >>> > >>> Very interesting. For completeness please run also 24 and 42 CPUs > >>> without the GPUs. Note that the default layout for CPU cores is not good. > >>> You will want 3 cores on each socket then 12 on each. > >>> > >>> Thanks > >>> > >>> Barry > >>> > >>> Since Tim is one of our reviewers next week this is a very good test > >>> matrix :-) > >>> > >>> > >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev > >>>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> > >>>> wrote: > >>>> > >>>> Click the links to visualize it. > >>>> > >>>> 6 ranks > >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0= > >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU > >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f > >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view > >>>> > >>>> 24 ranks > >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= > >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU > >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f > >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view > >>>> > >>>> --Junchao Zhang > >>>> > >>>> > >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev > >>>> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> > >>>> wrote: > >>>> Junchao, > >>>> > >>>> Can you share your 'jsrun' command so that we can see how you are > >>>> mapping things to resource sets? > >>>> > >>>> --Richard > >>>> > >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote: > >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix > >>>>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 > >>>>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. > >>>>> I found MatMult was almost dominated by VecScatter in this simple test. > >>>>> Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve > >>>>> performance. But if I enabled Multi-Process Service on Summit and used > >>>>> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know > >>>>> why and have to profile it. I will also collect data with multiple > >>>>> nodes. Are the matrix and tests proper? > >>>>> > >>>>> ------------------------------------------------------------------------------------------------------------------------ > >>>>> Event Count Time (sec) Flop > >>>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - > >>>>> - GpuToCpu - GPU > >>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen > >>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size > >>>>> Count Size %F > >>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- > >>>>> 6 MPI ranks (CPU version) > >>>>> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 > >>>>> 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 > >>>>> 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> > >>>>> 6 MPI ranks + 6 GPUs + regular SF > >>>>> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 > >>>>> 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 > >>>>> 1.02e+02 100 2.69e+02 100 > >>>>> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 > >>>>> 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 > >>>>> 0.00e+00 100 2.69e+02 0 > >>>>> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 > >>>>> 1.02e+02 0 0.00e+00 0 > >>>>> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 > >>>>> 0.00e+00 100 2.69e+02 0 > >>>>> > >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF > >>>>> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 > >>>>> 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 > >>>>> 0.00e+00 0 0.00e+00 100 > >>>>> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 > >>>>> 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> > >>>>> 24 MPI ranks + 6 GPUs + regular SF > >>>>> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 > >>>>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 > >>>>> 4.61e+01 100 6.72e+01 100 > >>>>> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 > >>>>> 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 > >>>>> 0.00e+00 100 6.72e+01 0 > >>>>> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 > >>>>> 4.61e+01 0 0.00e+00 0 > >>>>> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 > >>>>> 0.00e+00 100 6.72e+01 0 > >>>>> > >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF > >>>>> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 > >>>>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 > >>>>> 0.00e+00 0 0.00e+00 100 > >>>>> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 > >>>>> 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 > >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 > >>>>> 0.00e+00 0 0.00e+00 0 > >>>>> > >>>>> > >>>>> --Junchao Zhang > >>>> > >>> > > <SummitNode.png>