Thanks

> On Sep 21, 2019, at 10:17 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
> 
> 42 cores have better performance.
> 
> 36 MPI ranks
> MatMult              100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> 0.0e+00  6 99 97 28  0 100100100100  0 25145       0      0 0.00e+00    0 
> 0.00e+00  0
> VecScatterBegin      100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> 0.0e+00  0  0 97 28  0   1  0100100  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> VecScatterEnd        100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> 
> --Junchao Zhang
> 
> 
> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> 
>   Junchao,
> 
>     Mark has a good point; could you also try for completeness the CPU with 
> 36 cores and see if it is any better than the 42 core case?
> 
>   Barry
> 
>   So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
> GPUs for the multiply for this problem size.
> 
> > On Sep 21, 2019, at 6:40 PM, Mark Adams <mfad...@lbl.gov> wrote:
> > 
> > I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> > saturated at that point.
> > 
> > On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> > <petsc-dev@mcs.anl.gov> wrote:
> > Here are CPU version results on one node with 24 cores, 42 cores. Click the 
> > links for core layout.
> > 
> > 24 MPI ranks, 
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> > 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 42 MPI ranks, 
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> > MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> > 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> > 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > --Junchao Zhang
> > 
> > 
> > On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > 
> >   Junchao,
> > 
> >    Very interesting. For completeness please run also 24 and 42 CPUs 
> > without the GPUs. Note that the default layout for CPU cores is not good. 
> > You will want 3 cores on each socket then 12 on each.
> > 
> >   Thanks
> > 
> >    Barry
> > 
> >   Since Tim is one of our reviewers next week this is a very good test 
> > matrix :-)
> > 
> > 
> > > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> > > <petsc-dev@mcs.anl.gov> wrote:
> > > 
> > > Click the links to visualize it.
> > > 
> > > 6 ranks
> > > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> > > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > > 
> > > 24 ranks
> > > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> > > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > > 
> > > --Junchao Zhang
> > > 
> > > 
> > > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> > > <petsc-dev@mcs.anl.gov> wrote:
> > > Junchao,
> > > 
> > > Can you share your 'jsrun' command so that we can see how you are mapping 
> > > things to resource sets?
> > > 
> > > --Richard
> > > 
> > > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> > >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> > >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> > >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
> > >> found MatMult was almost dominated by VecScatter in this simple test. 
> > >> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> > >> performance. But if I enabled Multi-Process Service on Summit and used 
> > >> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know 
> > >> why and have to profile it. I will also collect  data with multiple 
> > >> nodes. Are the matrix and tests proper?
> > >> 
> > >> ------------------------------------------------------------------------------------------------------------------------
> > >> Event                Count      Time (sec)     Flop                      
> > >>         --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
> > >> GpuToCpu - GPU
> > >>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> > >> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
> > >> Count   Size  %F
> > >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >> 6 MPI ranks (CPU version)
> > >> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> > >> 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> > >> 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> 
> > >> 6 MPI ranks + 6 GPUs + regular SF
> > >> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > >> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  
> > >> 100 2.69e+02 100
> > >> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> > >> 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  
> > >> 100 2.69e+02  0
> > >> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    
> > >> 0 0.00e+00  0
> > >> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  
> > >> 100 2.69e+02  0
> > >> 
> > >> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> > >> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > >> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    
> > >> 0 0.00e+00 100
> > >> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> > >> 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> 
> > >> 24 MPI ranks + 6 GPUs + regular SF
> > >> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> > >> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  
> > >> 100 6.72e+01 100
> > >> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
> > >> 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  
> > >> 100 6.72e+01  0
> > >> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    
> > >> 0 0.00e+00  0
> > >> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  
> > >> 100 6.72e+01  0
> > >> 
> > >> 24 MPI ranks + 6 GPUs + CUDA-aware SF
> > >> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
> > >> 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    
> > >> 0 0.00e+00 100
> > >> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
> > >> 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    
> > >> 0 0.00e+00  0
> > >> 
> > >> 
> > >> --Junchao Zhang
> > > 
> > 
> 

Reply via email to