On Sat, Sep 21, 2019 at 12:48 AM Smith, Barry F. via petsc-dev < petsc-dev@mcs.anl.gov> wrote:
> > Junchao, > > Very interesting. For completeness please run also 24 and 42 CPUs > without the GPUs. Note that the default layout for CPU cores is not good. > You will want 3 cores on each socket then 12 on each. > His parms are balanced. see: https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0= > > Thanks > > Barry > > Since Tim is one of our reviewers next week this is a very good test > matrix :-) > > > > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev < > petsc-dev@mcs.anl.gov> wrote: > > > > Click the links to visualize it. > > > > 6 ranks > > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0= > > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view > > > > 24 ranks > > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= > > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view > > > > --Junchao Zhang > > > > > > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev < > petsc-dev@mcs.anl.gov> wrote: > > Junchao, > > > > Can you share your 'jsrun' command so that we can see how you are > mapping things to resource sets? > > > > --Richard > > > > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote: > >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix > Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 > times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I > found MatMult was almost dominated by VecScatter in this simple test. Using > 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve performance. But > if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I > found CUDA aware SF hurt performance. I don't know why and have to profile > it. I will also collect data with multiple nodes. Are the matrix and tests > proper? > >> > >> > ------------------------------------------------------------------------------------------------------------------------ > >> Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > GpuToCpu - GPU > >> Max Ratio Max Ratio Max Ratio Mess > AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count > Size Count Size %F > >> > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> 6 MPI ranks (CPU version) > >> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 > 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 > 0.00e+00 0 0.00e+00 0 > >> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 > 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> > >> 6 MPI ranks + 6 GPUs + regular SF > >> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 > 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 > 1.02e+02 100 2.69e+02 100 > >> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 > 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 > 0.00e+00 100 2.69e+02 0 > >> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 > 1.02e+02 0 0.00e+00 0 > >> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 > 0.00e+00 100 2.69e+02 0 > >> > >> 6 MPI ranks + 6 GPUs + CUDA-aware SF > >> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 > 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 > 0.00e+00 0 0.00e+00 100 > >> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 > 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> > >> 24 MPI ranks + 6 GPUs + regular SF > >> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 > 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 > 4.61e+01 100 6.72e+01 100 > >> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 > 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 > 0.00e+00 100 6.72e+01 0 > >> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 > 4.61e+01 0 0.00e+00 0 > >> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 > 0.00e+00 100 6.72e+01 0 > >> > >> 24 MPI ranks + 6 GPUs + CUDA-aware SF > >> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 > 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 > 0.00e+00 0 0.00e+00 100 > >> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 > 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > >> > >> > >> --Junchao Zhang > > > >