[petsc-dev] MatMult on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev
I downloaded a sparse matrix (HV15R) 
from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran 
the same MatMult 100 times on one node of Summit with -mat_type aijcusparse 
-vec_type cuda. I found MatMult was almost dominated by VecScatter in this 
simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 24 ranks 
+ 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to 
profile it. I will also collect  data with multiple nodes. Are the matrix and 
tests proper?


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks (CPU version)
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
6.72e+01  0
VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
0.00e+00  0
VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 387864   9733910 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  1  0 97 25  0  35  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  48  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0


--Junchao Zhang


Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Mills, Richard Tran via petsc-dev
Junchao,

Can you share your 'jsrun' command so that we can see how you are mapping 
things to resource sets?

--Richard

On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
I downloaded a sparse matrix (HV15R) 
from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran 
the same MatMult 100 times on one node of Summit with -mat_type aijcusparse 
-vec_type cuda. I found MatMult was almost dominated by VecScatter in this 
simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 24 ranks 
+ 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to 
profile it. I will also collect  data with multiple nodes. Are the matrix and 
tests proper?


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks (CPU version)
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
6.72e+01  0
VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
0.00e+00  0
VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 387864   9733910 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  1  0 97 25  0  35  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  48  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0


--Junchao Zhang



Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev
Click the links to visualize it.

6 ranks
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution 
packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse 
-vec_type cuda -n 100 -log_view

24 ranks
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution 
packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse 
-vec_type cuda -n 100 -log_view

--Junchao Zhang


On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Junchao,

Can you share your 'jsrun' command so that we can see how you are mapping 
things to resource sets?

--Richard

On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
I downloaded a sparse matrix (HV15R) 
from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran 
the same MatMult 100 times on one node of Summit with -mat_type aijcusparse 
-vec_type cuda. I found MatMult was almost dominated by VecScatter in this 
simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 24 ranks 
+ 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to 
profile it. I will also collect  data with multiple nodes. Are the matrix and 
tests proper?


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks (CPU version)
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
6.72e+01  0
VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
0.00e+00  0
VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 387864   973391   

Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Smith, Barry F. via petsc-dev


  Junchao,

   Very interesting. For completeness please run also 24 and 42 CPUs without 
the GPUs. Note that the default layout for CPU cores is not good. You will want 
3 cores on each socket then 12 on each.

  Thanks

   Barry

  Since Tim is one of our reviewers next week this is a very good test matrix 
:-)


> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
>  wrote:
> 
> Click the links to visualize it.
> 
> 6 ranks
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> 
> 24 ranks
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> 
> --Junchao Zhang
> 
> 
> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
>  wrote:
> Junchao,
> 
> Can you share your 'jsrun' command so that we can see how you are mapping 
> things to resource sets?
> 
> --Richard
> 
> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
>> Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node 
>> of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was 
>> almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 
>> GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
>> Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
>> aware SF hurt performance. I don't know why and have to profile it. I will 
>> also collect  data with multiple nodes. Are the matrix and tests proper?
>> 
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 6 MPI ranks (CPU version)
>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
>> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> 
>> 6 MPI ranks + 6 GPUs + regular SF
>> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
>> 2.69e+02 100
>> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
>> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
>> 2.69e+02  0
>> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
>> 0.00e+00  0
>> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
>> 2.69e+02  0
>> 
>> 6 MPI ranks + 6 GPUs + CUDA-aware SF
>> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
>> 0.00e+00 100
>> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
>> 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> 
>> 24 MPI ranks + 6 GPUs + regular SF
>> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
>> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
>> 6.72e+01 100
>> VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
>> 0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
>> 6.72e+01  0
>> VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e

Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev
Here are CPU version results on one node with 24 cores, 42 cores. Click the 
links for core layout.

24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

--Junchao Zhang


On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

   Very interesting. For completeness please run also 24 and 42 CPUs without 
the GPUs. Note that the default layout for CPU cores is not good. You will want 
3 cores on each socket then 12 on each.

  Thanks

   Barry

  Since Tim is one of our reviewers next week this is a very good test matrix 
:-)


> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Click the links to visualize it.
>
> 6 ranks
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>
> 24 ranks
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
> Junchao,
>
> Can you share your 'jsrun' command so that we can see how you are mapping 
> things to resource sets?
>
> --Richard
>
> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
>> Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node 
>> of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was 
>> almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 
>> GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
>> Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
>> aware SF hurt performance. I don't know why and have to profile it. I will 
>> also collect  data with multiple nodes. Are the matrix and tests proper?
>>
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 6 MPI ranks (CPU version)
>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
>> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>>
>> 6 MPI ranks + 6 GPUs + regular SF
>> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
>> 2.69e+02 100
>> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
>> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
>> 2.69e+02  0
>> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
>

Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Smith, Barry F. via petsc-dev


  Dang, makes the GPUs less impressive :-). 

> On Sep 21, 2019, at 12:44 AM, Zhang, Junchao  wrote:
> 
> Here are CPU version results on one node with 24 cores, 42 cores. Click the 
> links for core layout.
> 
> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> --Junchao Zhang
> 
> 
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F.  wrote:
> 
>   Junchao,
> 
>Very interesting. For completeness please run also 24 and 42 CPUs without 
> the GPUs. Note that the default layout for CPU cores is not good. You will 
> want 3 cores on each socket then 12 on each.
> 
>   Thanks
> 
>Barry
> 
>   Since Tim is one of our reviewers next week this is a very good test matrix 
> :-)
> 
> 
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> >  wrote:
> > 
> > Click the links to visualize it.
> > 
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > 
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > 
> > --Junchao Zhang
> > 
> > 
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> >  wrote:
> > Junchao,
> > 
> > Can you share your 'jsrun' command so that we can see how you are mapping 
> > things to resource sets?
> > 
> > --Richard
> > 
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
> >> found MatMult was almost dominated by VecScatter in this simple test. 
> >> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> >> performance. But if I enabled Multi-Process Service on Summit and used 24 
> >> ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why 
> >> and have to profile it. I will also collect  data with multiple nodes. Are 
> >> the matrix and tests proper?
> >> 
> >> 
> >> EventCount  Time (sec) Flop
> >>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - 
> >> GpuToCpu - GPU
> >>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> >> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
> >> Count   Size  %F
> >> ---
> >> 6 MPI ranks (CPU version)
> >> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> >> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> >> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> 
> >> 6 MPI ranks + 6 GPUs + regular SF
> >> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> >> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> >> 2.69e+02 100
> >> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev


   Hannah, Junchao and Richard,

The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally funky. 
951558 and 973391 they are so much lower than unvirtualized 3084009
  and 3133521 and yet the total time to solution is similar for the runs.

Is it possible these are being counted or calculated wrong? If not what 
does this mean? Please check the code that computes them (I can't imagine it is 
wrong but ...)

It means the GPUs are taking 3.x times more to do the multiplies in the MPS 
case but where is that time coming from in the other numbers? Communication 
time doesn't drop that much?

I can't present these numbers with this huge inconsistency

Thanks,

   Barry




> On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
>  wrote:
> 
> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node 
> of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was 
> almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 
> GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> aware SF hurt performance. I don't know why and have to profile it. I will 
> also collect  data with multiple nodes. Are the matrix and tests proper?
> 
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 6 MPI ranks (CPU version)
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 6 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> 2.69e+02 100
> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
> 0.00e+00  0
> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> 
> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 24 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
> 6.72e+01 100
> VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
> 6.72e+01  0
> VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
> 0.00e+00  0
> VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
> 6.72e+01  0
> 
> 24 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  1 99 97 25  0 10010010

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Zhang, Junchao via petsc-dev
We log gpu time before/after cusparse calls. 
https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
But according to 
https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, 
cusparse is asynchronous. Does that mean the gpu time is meaningless?
--Junchao Zhang


On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   Hannah, Junchao and Richard,

The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally funky. 
951558 and 973391 they are so much lower than unvirtualized 3084009
  and 3133521 and yet the total time to solution is similar for the runs.

Is it possible these are being counted or calculated wrong? If not what 
does this mean? Please check the code that computes them (I can't imagine it is 
wrong but ...)

It means the GPUs are taking 3.x times more to do the multiplies in the MPS 
case but where is that time coming from in the other numbers? Communication 
time doesn't drop that much?

I can't present these numbers with this huge inconsistency

Thanks,

   Barry




> On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node 
> of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was 
> almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 
> GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> aware SF hurt performance. I don't know why and have to profile it. I will 
> also collect  data with multiple nodes. Are the matrix and tests proper?
>
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 6 MPI ranks (CPU version)
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 6 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> 2.69e+02 100
> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
> 0.00e+00  0
> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
>
> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 24 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
> 6.72e+01 100
> VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
> 6.72e+01  0
> VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 4.4487e-03 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev



> On Sep 21, 2019, at 11:00 AM, Zhang, Junchao  wrote:
> 
> We log gpu time before/after cusparse calls. 
> https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
> But according to 
> https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, 
> cusparse is asynchronous. Does that mean the gpu time is meaningless?
> --Junchao Zhang

  Yes it looks like those numbers are meaningless from that routine, thanks for 
checking. 

  The ierr=WaitForGPU();CHKERRCUDA(ierr); is what is turned on for timing and 
would capture all the compute time. Perhaps it could be moved to appropriate 
places in that routine.  Of course when running with -log_view WaitForGPU(); is 
turned on and that may slow things down so we don't get the best numbers; I 
have no idea if they would be noticeably higher if the WaitForGPU was off.

  I don't understand how the streams are used if(!cusparsestruct->stream){  it 
seems some methods use them someplaces but not others.

   Barry

> 
> 
> On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F.  wrote:
> 
>Hannah, Junchao and Richard,
> 
> The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally 
> funky. 951558 and 973391 they are so much lower than unvirtualized 3084009
>   and 3133521 and yet the total time to solution is similar for the runs.
> 
> Is it possible these are being counted or calculated wrong? If not what 
> does this mean? Please check the code that computes them (I can't imagine it 
> is wrong but ...)
> 
> It means the GPUs are taking 3.x times more to do the multiplies in the 
> MPS case but where is that time coming from in the other numbers? 
> Communication time doesn't drop that much?
> 
> I can't present these numbers with this huge inconsistency
> 
> Thanks,
> 
>Barry
> 
> 
> 
> 
> > On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
> >  wrote:
> > 
> > I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> > Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one 
> > node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult 
> > was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 
> > 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> > Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> > aware SF hurt performance. I don't know why and have to profile it. I will 
> > also collect  data with multiple nodes. Are the matrix and tests proper?
> > 
> > 
> > EventCount  Time (sec) Flop 
> >  --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - 
> > GpuToCpu - GPU
> >Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count 
> >   Size  %F
> > ---
> > 6 MPI ranks (CPU version)
> > MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > 
> > 6 MPI ranks + 6 GPUs + regular SF
> > MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> > 2.69e+02 100
> > VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
> > 2.69e+02  0
> > VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
> > 0.00e+00  0
> > VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
> > 2.69e+02  0
> > 
> > 6 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
> > 0.00e+00 100
> > VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  1  0 97 18  0  70 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev


  Sorry,  forgot

  Could you please put the GPU wait call before each of the log ends in 
that routine and see what kind of new numbers you get?

   Thanks

 Barry


> On Sep 21, 2019, at 11:00 AM, Zhang, Junchao  wrote:
> 
> We log gpu time before/after cusparse calls. 
> https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
> But according to 
> https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, 
> cusparse is asynchronous. Does that mean the gpu time is meaningless?
> --Junchao Zhang
> 
> 
> On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F.  wrote:
> 
>Hannah, Junchao and Richard,
> 
> The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally 
> funky. 951558 and 973391 they are so much lower than unvirtualized 3084009
>   and 3133521 and yet the total time to solution is similar for the runs.
> 
> Is it possible these are being counted or calculated wrong? If not what 
> does this mean? Please check the code that computes them (I can't imagine it 
> is wrong but ...)
> 
> It means the GPUs are taking 3.x times more to do the multiplies in the 
> MPS case but where is that time coming from in the other numbers? 
> Communication time doesn't drop that much?
> 
> I can't present these numbers with this huge inconsistency
> 
> Thanks,
> 
>Barry
> 
> 
> 
> 
> > On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
> >  wrote:
> > 
> > I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> > Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one 
> > node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult 
> > was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 
> > 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> > Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> > aware SF hurt performance. I don't know why and have to profile it. I will 
> > also collect  data with multiple nodes. Are the matrix and tests proper?
> > 
> > 
> > EventCount  Time (sec) Flop 
> >  --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - 
> > GpuToCpu - GPU
> >Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count 
> >   Size  %F
> > ---
> > 6 MPI ranks (CPU version)
> > MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > 
> > 6 MPI ranks + 6 GPUs + regular SF
> > MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> > 2.69e+02 100
> > VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
> > 2.69e+02  0
> > VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
> > 0.00e+00  0
> > VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
> > 2.69e+02  0
> > 
> > 6 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
> > 0.00e+00 100
> > VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > 
> > 24 MPI ranks + 6 GPUs + regular SF
> > MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
> > 6.72e+01 100
> 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Zhang, Junchao via petsc-dev
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  14  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + regular SF  + No log_view
MatMult: 100 1.0 1.4180e-01 
399268

6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult  100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 512224   6420750 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   6  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  16  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult: 100 1.0 9.8344e-02 
575717

24 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  0 99 97 25  0 100100100100  0 489223   708601  100 4.61e+01  100 
6.7

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev


   Thanks!  This is great stuff, very useful.

   Barry


> On Sep 21, 2019, at 2:03 PM, Zhang, Junchao  wrote:
> 
> I made the following changes:
> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>   PetscFunctionReturn(0);
> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
> swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE 
> is blocking, I changed the order to have better overlap.
>   ierr = 
> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr = 
> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
> 3) Log time directly in the test code so we can also know execution time 
> without -log_view (hence cuda synchronization). I manually calculated the 
> Total Mflop/s for these cases for easy comparison.
> 
> <>
> 
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 6 MPI ranks,
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 24 MPI ranks
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 42 MPI ranks
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 6 MPI ranks + 6 GPUs + regular SF + log_view
> MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
> 2.69e+02 100
> VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
> 0.00e+00  0
> VecCopyFromSome  100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  14  0  0  0  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> 
> 6 MPI ranks + 6 GPUs + regular SF  + No log_view
> MatMult: 100 1.0 1.4180e-01   
>   399268   
> 
> 6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
> MatMult  100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  1 99 97 18  0 100100100100  0 512224   6420750 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   6  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  16  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
> M

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Mark Adams via petsc-dev
On Sat, Sep 21, 2019 at 12:48 AM Smith, Barry F. via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

>
>   Junchao,
>
>Very interesting. For completeness please run also 24 and 42 CPUs
> without the GPUs. Note that the default layout for CPU cores is not good.
> You will want 3 cores on each socket then 12 on each.
>

His parms are balanced. see:
https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=


>
>   Thanks
>
>Barry
>
>   Since Tim is one of our reviewers next week this is a very good test
> matrix :-)
>
>
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > Click the links to visualize it.
> >
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > --Junchao Zhang
> >
> >
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> > Junchao,
> >
> > Can you share your 'jsrun' command so that we can see how you are
> mapping things to resource sets?
> >
> > --Richard
> >
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix
> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
> found MatMult was almost dominated by VecScatter in this simple test. Using
> 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But
> if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I
> found CUDA aware SF hurt performance. I don't know why and have to profile
> it. I will also collect  data with multiple nodes. Are the matrix and tests
> proper?
> >>
> >>
> 
> >> EventCount  Time (sec) Flop
>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
> GpuToCpu - GPU
> >>Max Ratio  Max Ratio   Max  Ratio  Mess
>  AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
>  Size   Count   Size  %F
> >>
> ---
> >> 6 MPI ranks (CPU version)
> >> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0
> 0.00e+000 0.00e+00  0
> >> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
> 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0
> 0.00e+000 0.00e+00  0
> >> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0
> 0.00e+000 0.00e+00  0
> >>
> >> 6 MPI ranks + 6 GPUs + regular SF
> >> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100
> 1.02e+02  100 2.69e+02 100
> >> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03
> 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0
> 0.00e+00  100 2.69e+02  0
> >> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0
> 0.00e+000 0.00e+00  0
> >> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100
> 1.02e+020 0.00e+00  0
> >> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0
> 0.00e+00  100 2.69e+02  0
> >>
> >> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> >> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0
> 0.00e+000 0.00e+00 100
> >> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03
> 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0
> 0.00e+000 0.00e+00  0
> >> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0
> 0.00e+000 0.00e+00  0
> >>
> >> 24 MPI ranks + 6 GPUs + regular SF
> >> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04
> 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100
> 4.61e+01  100 6.72e+01 100
> 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Mark Adams via petsc-dev
I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty
saturated at that point.

On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Here are CPU version results on one node with 24 cores, 42 cores. Click
> the links for core layout.
>
> 24 MPI ranks,
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> 42 MPI ranks,
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
> wrote:
>
>>
>>   Junchao,
>>
>>Very interesting. For completeness please run also 24 and 42 CPUs
>> without the GPUs. Note that the default layout for CPU cores is not good.
>> You will want 3 cores on each socket then 12 on each.
>>
>>   Thanks
>>
>>Barry
>>
>>   Since Tim is one of our reviewers next week this is a very good test
>> matrix :-)
>>
>>
>> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> >
>> > Click the links to visualize it.
>> >
>> > 6 ranks
>> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
>> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >
>> > 24 ranks
>> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >
>> > --Junchao Zhang
>> >
>> >
>> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> > Junchao,
>> >
>> > Can you share your 'jsrun' command so that we can see how you are
>> mapping things to resource sets?
>> >
>> > --Richard
>> >
>> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix
>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
>> found MatMult was almost dominated by VecScatter in this simple test. Using
>> 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But
>> if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I
>> found CUDA aware SF hurt performance. I don't know why and have to profile
>> it. I will also collect  data with multiple nodes. Are the matrix and tests
>> proper?
>> >>
>> >>
>> 
>> >> EventCount  Time (sec) Flop
>>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
>> GpuToCpu - GPU
>> >>Max Ratio  Max Ratio   Max  Ratio  Mess
>>  AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
>>  Size   Count   Size  %F
>> >>
>> ---
>> >> 6 MPI ranks (CPU version)
>> >> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0
>> 0.00e+000 0.00e+00  0
>> >> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0
>> 0.00e+000 0.00e+00  0
>> >> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0
>> 0.00e+000 0.00e+00  0
>> >>
>> >> 6 MPI ranks + 6 GPUs + regular SF
>> >> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100
>> 1.02e+02  100 2.69e+02 100
>> >> VecScat

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev


  Junchao,

Mark has a good point; could you also try for completeness the CPU with 36 
cores and see if it is any better than the 42 core case?

  Barry

  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
GPUs for the multiply for this problem size.

> On Sep 21, 2019, at 6:40 PM, Mark Adams  wrote:
> 
> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> saturated at that point.
> 
> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
>  wrote:
> Here are CPU version results on one node with 24 cores, 42 cores. Click the 
> links for core layout.
> 
> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> --Junchao Zhang
> 
> 
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F.  wrote:
> 
>   Junchao,
> 
>Very interesting. For completeness please run also 24 and 42 CPUs without 
> the GPUs. Note that the default layout for CPU cores is not good. You will 
> want 3 cores on each socket then 12 on each.
> 
>   Thanks
> 
>Barry
> 
>   Since Tim is one of our reviewers next week this is a very good test matrix 
> :-)
> 
> 
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> >  wrote:
> > 
> > Click the links to visualize it.
> > 
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > 
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > 
> > --Junchao Zhang
> > 
> > 
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> >  wrote:
> > Junchao,
> > 
> > Can you share your 'jsrun' command so that we can see how you are mapping 
> > things to resource sets?
> > 
> > --Richard
> > 
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
> >> found MatMult was almost dominated by VecScatter in this simple test. 
> >> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> >> performance. But if I enabled Multi-Process Service on Summit and used 24 
> >> ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why 
> >> and have to profile it. I will also collect  data with multiple nodes. Are 
> >> the matrix and tests proper?
> >> 
> >> 
> >> EventCount  Time (sec) Flop
> >>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - 
> >> GpuToCpu - GPU
> >>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> >> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
> >> Count   Size  %F
> >> ---
> >> 6 MPI ranks (CPU version)
> >> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> >> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> >> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Zhang, Junchao via petsc-dev
42 cores have better performance.

36 MPI ranks
MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

--Junchao Zhang


On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

Mark has a good point; could you also try for completeness the CPU with 36 
cores and see if it is any better than the 42 core case?

  Barry

  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
GPUs for the multiply for this problem size.

> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
>
> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> saturated at that point.
>
> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
> Here are CPU version results on one node with 24 cores, 42 cores. Click the 
> links for core layout.
>
> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao,
>
>Very interesting. For completeness please run also 24 and 42 CPUs without 
> the GPUs. Note that the default layout for CPU cores is not good. You will 
> want 3 cores on each socket then 12 on each.
>
>   Thanks
>
>Barry
>
>   Since Tim is one of our reviewers next week this is a very good test matrix 
> :-)
>
>
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >
> > Click the links to visualize it.
> >
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > --Junchao Zhang
> >
> >
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> > Junchao,
> >
> > Can you share your 'jsrun' command so that we can see how you are mapping 
> > things to resource sets?
> >
> > --Richard
> >
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
> >> found MatMult was almost dominated by VecScatter in this simple test. 
> >> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> >> performance. But if I enabled Multi-Process Service on Summit and used 24 
> >> ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why 
> >> and have to profile it. I will also collect  data with multiple nodes. Are 
> >> the matrix and tests proper?
> >>
> >> 
> >> EventCount  Time (sec) Flop
> >>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - 
> >> GpuToCpu - GPU
> >>Max Ratio 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev
  
  Thanks


> On Sep 21, 2019, at 10:17 PM, Zhang, Junchao  wrote:
> 
> 42 cores have better performance.
> 
> 36 MPI ranks
> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> --Junchao Zhang
> 
> 
> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F.  wrote:
> 
>   Junchao,
> 
> Mark has a good point; could you also try for completeness the CPU with 
> 36 cores and see if it is any better than the 42 core case?
> 
>   Barry
> 
>   So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
> GPUs for the multiply for this problem size.
> 
> > On Sep 21, 2019, at 6:40 PM, Mark Adams  wrote:
> > 
> > I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> > saturated at that point.
> > 
> > On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> >  wrote:
> > Here are CPU version results on one node with 24 cores, 42 cores. Click the 
> > links for core layout.
> > 
> > 24 MPI ranks, 
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> > 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > 
> > 42 MPI ranks, 
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> > MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> > 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> > 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> > 0.00e+00  0
> > 
> > --Junchao Zhang
> > 
> > 
> > On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F.  wrote:
> > 
> >   Junchao,
> > 
> >Very interesting. For completeness please run also 24 and 42 CPUs 
> > without the GPUs. Note that the default layout for CPU cores is not good. 
> > You will want 3 cores on each socket then 12 on each.
> > 
> >   Thanks
> > 
> >Barry
> > 
> >   Since Tim is one of our reviewers next week this is a very good test 
> > matrix :-)
> > 
> > 
> > > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> > >  wrote:
> > > 
> > > Click the links to visualize it.
> > > 
> > > 6 ranks
> > > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> > > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > > 
> > > 24 ranks
> > > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> > > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> > > 
> > > --Junchao Zhang
> > > 
> > > 
> > > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> > >  wrote:
> > > Junchao,
> > > 
> > > Can you share your 'jsrun' command so that we can see how you are mapping 
> > > things to resource sets?
> > > 
> > > --Richard
> > > 
> > > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> > >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> > >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> > >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
> > >> found MatMult was almost dominated by VecScatter in this simple test. 
> > >> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> > >> performance. But if I enabled Multi-Process Service on Summit and used 
> > >> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know 
> > >> why and have to profile it. I will also collect  data with multiple 
> > >> nodes. Are the matrix and tests proper?
> > >> 
> > >> 
> > >> EventCount  T

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Karl Rupp via petsc-dev

Hi Junchao,

thanks, these numbers are interesting.

Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs. 
a non-CUDA-aware MPI that still keeps the benefits of your 
packing/unpacking routines?


I'd like to get a feeling of where the performance gains come from. Is 
it due to the reduced PCI-Express transfer for the scatters (i.e. 
packing/unpacking and transferring only the relevant entries) on each 
rank, or is it some low-level optimization that makes the MPI-part of 
the communication faster? Your current MR includes both; it would be 
helpful to know whether we can extract similar benefits for other GPU 
backends without having to require "CUDA-awareness" of MPI. If the 
benefits are mostly due to the packing/unpacking, we could carry over 
the benefits to other GPU backends (e.g. upcoming Intel GPUs) without 
having to wait for an "Intel-GPU-aware MPI".


Best regards,
Karli


On 9/21/19 6:22 AM, Zhang, Junchao via petsc-dev wrote:
I downloaded a sparse matrix (HV15R 
) from Florida Sparse Matrix 
Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
found MatMult was almost dominated by VecScatter in this simple test. 
Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 
24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know 
why and have to profile it. I will also collect  data with multiple 
nodes. Are the matrix and tests proper?



Event                Count      Time (sec)     Flop 
          --- Global ---  --- Stage   Total   GPU    - CpuToGpu -   
- GpuToCpu - GPU
                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen 
  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
Count   Size  %F

---
6 MPI ranks (CPU version)
MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00   
  0 0.00e+00  0
VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00   
  0 0.00e+00  0
VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00   
  0 0.00e+00  0


6 MPI ranks + 6 GPUs + regular SF
MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02 
  100 2.69e+02 100
VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00 
  100 2.69e+02  0
VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00   
  0 0.00e+00  0
VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02   
  0 0.00e+00  0
VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00 
  100 2.69e+02  0


6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00   
  0 0.00e+00 100
VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00   
  0 0.00e+00  0
VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00   
  0 0.00e+00  0


24 MPI ranks + 6 GPUs + regular SF
MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01 
  100 6.72e+01 100
VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00 
  100 6.72e+01  0
VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00   
  0 0.00e+00  0
VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01   
  0 0.00e+00  0
VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00 
  100 6.72e+01  0


24 MPI ranks + 6 GPUs + CUDA-aware 

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Jed Brown via petsc-dev
For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
GB/s for the node for the best case (42 ranks).

My understanding is that these systems have 8 channels of DDR4-2666 per
socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
system, and 270 GB/s STREAM Triad according to this post

  
https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/

Is this 60% of Triad the best we can get for SpMV?

"Zhang, Junchao via petsc-dev"  writes:

> 42 cores have better performance.
>
> 36 MPI ranks
> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> --Junchao Zhang
>
>
> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao,
>
> Mark has a good point; could you also try for completeness the CPU with 
> 36 cores and see if it is any better than the 42 core case?
>
>   Barry
>
>   So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
> GPUs for the multiply for this problem size.
>
>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
>> mailto:mfad...@lbl.gov>> wrote:
>>
>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
>> saturated at that point.
>>
>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
>> mailto:petsc-dev@mcs.anl.gov>> wrote:
>> Here are CPU version results on one node with 24 cores, 42 cores. Click the 
>> links for core layout.
>>
>> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>>
>> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>>
>> --Junchao Zhang
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
>> mailto:bsm...@mcs.anl.gov>> wrote:
>>
>>   Junchao,
>>
>>Very interesting. For completeness please run also 24 and 42 CPUs without 
>> the GPUs. Note that the default layout for CPU cores is not good. You will 
>> want 3 cores on each socket then 12 on each.
>>
>>   Thanks
>>
>>Barry
>>
>>   Since Tim is one of our reviewers next week this is a very good test 
>> matrix :-)
>>
>>
>> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
>> > mailto:petsc-dev@mcs.anl.gov>> wrote:
>> >
>> > Click the links to visualize it.
>> >
>> > 6 ranks
>> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
>> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
>> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
>> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >
>> > 24 ranks
>> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
>> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
>> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >
>> > --Junchao Zhang
>> >
>> >
>> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
>> > mailto:petsc-dev@mcs.anl.gov>> wrote:
>> > Junchao,
>> >
>> > Can you share your 'jsrun' command so that we can see how you are mapping 
>> > things to resource sets?
>> >
>> > --Richard
>> >
>> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
>> >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
>> >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
>> >> found MatMult was almost dominated by VecScatter in this simple test. 
>> >> Us

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Jed Brown via petsc-dev
Karl Rupp via petsc-dev  writes:

> Hi Junchao,
>
> thanks, these numbers are interesting.
>
> Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs. 
> a non-CUDA-aware MPI that still keeps the benefits of your 
> packing/unpacking routines?
>
> I'd like to get a feeling of where the performance gains come from. Is 
> it due to the reduced PCI-Express transfer 

It's NVLink, not PCI-express.

I wonder if the single-node latency bugs on AC922 are related to these
weird performance results.

https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0


Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev


  Jed,

  What does latency as a function of message size mean?   It is in the plots



> On Sep 21, 2019, at 11:15 PM, Jed Brown via petsc-dev  
> wrote:
> 
> Karl Rupp via petsc-dev  writes:
> 
>> Hi Junchao,
>> 
>> thanks, these numbers are interesting.
>> 
>> Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs. 
>> a non-CUDA-aware MPI that still keeps the benefits of your 
>> packing/unpacking routines?
>> 
>> I'd like to get a feeling of where the performance gains come from. Is 
>> it due to the reduced PCI-Express transfer 
> 
> It's NVLink, not PCI-express.
> 
> I wonder if the single-node latency bugs on AC922 are related to these
> weird performance results.
> 
> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0



Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev


  Junchao could try the PETSc (and non-PETSc) streams tests on the machine. 

  There are a few differences, compiler, the reported results are with OpenMP, 
different number of cores but yes the performance is a bit low. For DOE that is 
great, makes GPUs look better :-)


> On Sep 21, 2019, at 11:11 PM, Jed Brown  wrote:
> 
> For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> GB/s for the node for the best case (42 ranks).
> 
> My understanding is that these systems have 8 channels of DDR4-2666 per
> socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> system, and 270 GB/s STREAM Triad according to this post
> 
>  
> https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> 
> Is this 60% of Triad the best we can get for SpMV?
> 
> "Zhang, Junchao via petsc-dev"  writes:
> 
>> 42 cores have better performance.
>> 
>> 36 MPI ranks
>> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
>> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
>> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> 
>> --Junchao Zhang
>> 
>> 
>> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
>> mailto:bsm...@mcs.anl.gov>> wrote:
>> 
>>  Junchao,
>> 
>>Mark has a good point; could you also try for completeness the CPU with 
>> 36 cores and see if it is any better than the 42 core case?
>> 
>>  Barry
>> 
>>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
>> GPUs for the multiply for this problem size.
>> 
>>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
>>> mailto:mfad...@lbl.gov>> wrote:
>>> 
>>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
>>> saturated at that point.
>>> 
>>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
>>> mailto:petsc-dev@mcs.anl.gov>> wrote:
>>> Here are CPU version results on one node with 24 cores, 42 cores. Click the 
>>> links for core layout.
>>> 
>>> 24 MPI ranks, 
>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
>>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
>>> 0.00e+00  0
>>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
>>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> 
>>> 42 MPI ranks, 
>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
>>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
>>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
>>> 0.00e+00  0
>>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
>>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> 
>>> --Junchao Zhang
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
>>> mailto:bsm...@mcs.anl.gov>> wrote:
>>> 
>>>  Junchao,
>>> 
>>>   Very interesting. For completeness please run also 24 and 42 CPUs without 
>>> the GPUs. Note that the default layout for CPU cores is not good. You will 
>>> want 3 cores on each socket then 12 on each.
>>> 
>>>  Thanks
>>> 
>>>   Barry
>>> 
>>>  Since Tim is one of our reviewers next week this is a very good test 
>>> matrix :-)
>>> 
>>> 
 On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
 mailto:petsc-dev@mcs.anl.gov>> wrote:
 
 Click the links to visualize it.
 
 6 ranks
 https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
 jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
 --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
 HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
 
 24 ranks
 https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
 jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
 --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
 HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
 
 --Junchao Zhang
 
 
 On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
 mailto:petsc-dev@mcs.anl.gov>> wrote:
 Junchao,
 
 Can you share your 'jsrun' command s

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Jed Brown via petsc-dev
"Smith, Barry F."  writes:

>   Jed,
>
>   What does latency as a function of message size mean?   It is in the plots

It's just the wall-clock time to ping-pong a message of that size.  All
the small sizes take the same amount of time (i.e., the latency), then
transition to being network bandwidth limited for large sizes.

>
>> On Sep 21, 2019, at 11:15 PM, Jed Brown via petsc-dev 
>>  wrote:
>> 
>> Karl Rupp via petsc-dev  writes:
>> 
>>> Hi Junchao,
>>> 
>>> thanks, these numbers are interesting.
>>> 
>>> Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs. 
>>> a non-CUDA-aware MPI that still keeps the benefits of your 
>>> packing/unpacking routines?
>>> 
>>> I'd like to get a feeling of where the performance gains come from. Is 
>>> it due to the reduced PCI-Express transfer 
>> 
>> It's NVLink, not PCI-express.
>> 
>> I wonder if the single-node latency bugs on AC922 are related to these
>> weird performance results.
>> 
>> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0


Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Smith, Barry F. via petsc-dev



> On Sep 21, 2019, at 11:43 PM, Jed Brown  wrote:
> 
> "Smith, Barry F."  writes:
> 
>>  Jed,
>> 
>>  What does latency as a function of message size mean?   It is in the plots
> 
> It's just the wall-clock time to ping-pong a message of that size.  All
> the small sizes take the same amount of time (i.e., the latency), then
> transition to being network bandwidth limited for large sizes.

   Thanks, this is fine for the small size. But he has the graph up to size 
100 and the plotted values change for larger sizes, surely for 100 the 
time is a combination of latency and bandwidth? Isn't calling it latency a 
misnomer, or do people use this inconsistent terminology when doing ping-pongs? 


> 
>> 
>>> On Sep 21, 2019, at 11:15 PM, Jed Brown via petsc-dev 
>>>  wrote:
>>> 
>>> Karl Rupp via petsc-dev  writes:
>>> 
 Hi Junchao,
 
 thanks, these numbers are interesting.
 
 Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs. 
 a non-CUDA-aware MPI that still keeps the benefits of your 
 packing/unpacking routines?
 
 I'd like to get a feeling of where the performance gains come from. Is 
 it due to the reduced PCI-Express transfer 
>>> 
>>> It's NVLink, not PCI-express.
>>> 
>>> I wonder if the single-node latency bugs on AC922 are related to these
>>> weird performance results.
>>> 
>>> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0



Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Karl Rupp via petsc-dev




On 9/22/19 6:15 AM, Jed Brown wrote:

Karl Rupp via petsc-dev  writes:


Hi Junchao,

thanks, these numbers are interesting.

Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs.
a non-CUDA-aware MPI that still keeps the benefits of your
packing/unpacking routines?

I'd like to get a feeling of where the performance gains come from. Is
it due to the reduced PCI-Express transfer


It's NVLink, not PCI-express.


Indeed.




I wonder if the single-node latency bugs on AC922 are related to these
weird performance results.

https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0



Thanks for these numbers!
Intra-Node > Inter-Node is indeed weird. I haven't observed such an 
inversion before.


Best regards,
Karli


Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Zhang, Junchao via petsc-dev



On Sat, Sep 21, 2019 at 11:08 PM Karl Rupp via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hi Junchao,

thanks, these numbers are interesting.

Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs.
a non-CUDA-aware MPI that still keeps the benefits of your
packing/unpacking routines?

I'd like to get a feeling of where the performance gains come from. Is
it due to the reduced PCI-Express transfer for the scatters (i.e.
packing/unpacking and transferring only the relevant entries) on each
rank, or is it some low-level optimization that makes the MPI-part of
the communication faster? Your current MR includes both; it would be
helpful to know whether we can extract similar benefits for other GPU
backends without having to require "CUDA-awareness" of MPI. If the
benefits are mostly due to the packing/unpacking, we could carry over
the benefits to other GPU backends (e.g. upcoming Intel GPUs) without
having to wait for an "Intel-GPU-aware MPI".

Your argument is fair. I will add this support later. Besides performance 
benefit, GPU-aware can simplify user's code. That is why I think all vendors 
will converge on that.
This post https://devblogs.nvidia.com/introduction-cuda-aware-mpi/ has detailed 
explanation of CUDA-aware MPI. In short, it avoids CPU involvement and 
redundant memory copies.

Best regards,
Karli


On 9/21/19 6:22 AM, Zhang, Junchao via petsc-dev wrote:
> I downloaded a sparse matrix (HV15R
> ) from Florida Sparse Matrix
> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
> found MatMult was almost dominated by VecScatter in this simple test.
> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve
> performance. But if I enabled Multi-Process Service on Summit and used
> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know
> why and have to profile it. I will also collect  data with multiple
> nodes. Are the matrix and tests proper?
>
> 
> EventCount  Time (sec) Flop
>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -
> - GpuToCpu - GPU
> Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>   Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
> Count   Size  %F
> ---
> 6 MPI ranks (CPU version)
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+00
>   0 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+00
>   0 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+00
>   0 0.00e+00  0
>
> 6 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05
> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02
>   100 2.69e+02 100
> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05
> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00
>   100 2.69e+02  0
> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+00
>   0 0.00e+00  0
> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+02
>   0 0.00e+00  0
> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00
>   100 2.69e+02  0
>
> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05
> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00
>   0 0.00e+00 100
> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05
> 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+00
>   0 0.00e+00  0
> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+00
>   0 0.00e+00  0
>
> 24 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01
>   100 6.72e+01 100
> VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04
> 0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00
>   100 6.72e+01  0
> VecScatterEnd100 1.0 7.2969e-02

Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Jed Brown via petsc-dev
"Smith, Barry F."  writes:

>> On Sep 21, 2019, at 11:43 PM, Jed Brown  wrote:
>> 
>> "Smith, Barry F."  writes:
>> 
>>>  Jed,
>>> 
>>>  What does latency as a function of message size mean?   It is in the plots
>> 
>> It's just the wall-clock time to ping-pong a message of that size.  All
>> the small sizes take the same amount of time (i.e., the latency), then
>> transition to being network bandwidth limited for large sizes.
>
>Thanks, this is fine for the small size. But he has the graph up to
>size 100 and the plotted values change for larger sizes, surely
>for 100 the time is a combination of latency and bandwidth?
>Isn't calling it latency a misnomer, or do people use this
>inconsistent terminology when doing ping-pongs?

Latency of an operation is just how long from when you initiate it until
it completes.  Latency in a performance model, such as LogP, is additive
with other factors (like bandwidth and compute throughput).


Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Jed Brown via petsc-dev
Karl Rupp  writes:

>> I wonder if the single-node latency bugs on AC922 are related to these
>> weird performance results.
>> 
>> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0
>> 
>
> Thanks for these numbers!
> Intra-Node > Inter-Node is indeed weird. I haven't observed such an 
> inversion before.

As far as I know, it's been there since the machines were deployed
despite obviously being a bug.  I know people at LLNL regard it as a
bug, but it has not been their top priority (presumably at least in part
because applications have not clearly expressed the impact of latency
regressions on their science).


Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Smith, Barry F. via petsc-dev


  Ok, thanks. Then one has to be careful in HPC when using the term so each 
time it is used everyone in the conversation knows which one it is referring 
to. 


> On Sep 22, 2019, at 8:33 AM, Jed Brown  wrote:
> 
> "Smith, Barry F."  writes:
> 
>>> On Sep 21, 2019, at 11:43 PM, Jed Brown  wrote:
>>> 
>>> "Smith, Barry F."  writes:
>>> 
 Jed,
 
 What does latency as a function of message size mean?   It is in the plots
>>> 
>>> It's just the wall-clock time to ping-pong a message of that size.  All
>>> the small sizes take the same amount of time (i.e., the latency), then
>>> transition to being network bandwidth limited for large sizes.
>> 
>>   Thanks, this is fine for the small size. But he has the graph up to
>>   size 100 and the plotted values change for larger sizes, surely
>>   for 100 the time is a combination of latency and bandwidth?
>>   Isn't calling it latency a misnomer, or do people use this
>>   inconsistent terminology when doing ping-pongs?
> 
> Latency of an operation is just how long from when you initiate it until
> it completes.  Latency in a performance model, such as LogP, is additive
> with other factors (like bandwidth and compute throughput).



Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Smith, Barry F. via petsc-dev


   I'm guessing it would be very difficult to connect this particular 
performance bug with a decrease in performance for an actual full application 
since models don't catch this level of detail well (and  since you cannot run 
the application without the bug to see the better performance)?  IBM/Nvidia are 
not going to care about it if is just an abstract oddity as opposed to clearly 
demonstrating a problem for the use of the machine, especially if the machine 
is an orphan.

> On Sep 22, 2019, at 8:35 AM, Jed Brown via petsc-dev  
> wrote:
> 
> Karl Rupp  writes:
> 
>>> I wonder if the single-node latency bugs on AC922 are related to these
>>> weird performance results.
>>> 
>>> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0
>>> 
>> 
>> Thanks for these numbers!
>> Intra-Node > Inter-Node is indeed weird. I haven't observed such an 
>> inversion before.
> 
> As far as I know, it's been there since the machines were deployed
> despite obviously being a bug.  I know people at LLNL regard it as a
> bug, but it has not been their top priority (presumably at least in part
> because applications have not clearly expressed the impact of latency
> regressions on their science).



Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Jed Brown via petsc-dev
Run two resource sets on one side versus separate nodes.On Sep 22, 2019 08:46, "Smith, Barry F."  wrote:
   I'm guessing it would be very difficult to connect this particular performance bug with a decrease in performance for an actual full application since models don't catch this level of detail well (and  since you cannot run the application without the bug to see the better performance)?  IBM/Nvidia are not going to care about it if is just an abstract oddity as opposed to clearly demonstrating a problem for the use of the machine, especially if the machine is an orphan.

> On Sep 22, 2019, at 8:35 AM, Jed Brown via petsc-dev  wrote:
> 
> Karl Rupp  writes:
> 
>>> I wonder if the single-node latency bugs on AC922 are related to these
>>> weird performance results.
>>> 
>>> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0
>>> 
>> 
>> Thanks for these numbers!
>> Intra-Node > Inter-Node is indeed weird. I haven't observed such an 
>> inversion before.
> 
> As far as I know, it's been there since the machines were deployed
> despite obviously being a bug.  I know people at LLNL regard it as a
> bug, but it has not been their top priority (presumably at least in part
> because applications have not clearly expressed the impact of latency
> regressions on their science).




Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Smith, Barry F. via petsc-dev



> On Sep 22, 2019, at 9:56 AM, Jed Brown  wrote:
> 
> Run two resource sets on one side versus separate nodes.

  I don't know what this is suppose to mean. Is it a toy situation where you 
show the problem is measurable or a real application run properly at scale 
where you show the problem has an affect. Facilities care about real 
applications at scale losing performance but toys don't mean that much unless 
if you can convince them that it actually affects the real application at scale 
as well.  

   This discuss is probably not important so we should drop it. 


> 
> On Sep 22, 2019 08:46, "Smith, Barry F."  wrote:
> 
>I'm guessing it would be very difficult to connect this particular 
> performance bug with a decrease in performance for an actual full application 
> since models don't catch this level of detail well (and  since you cannot run 
> the application without the bug to see the better performance)?  IBM/Nvidia 
> are not going to care about it if is just an abstract oddity as opposed to 
> clearly demonstrating a problem for the use of the machine, especially if the 
> machine is an orphan. 
> 
> > On Sep 22, 2019, at 8:35 AM, Jed Brown via petsc-dev 
> >  wrote: 
> > 
> > Karl Rupp  writes: 
> > 
> >>> I wonder if the single-node latency bugs on AC922 are related to these 
> >>> weird performance results. 
> >>> 
> >>> https://docs.google.com/spreadsheets/d/1amFJIbpvs9oJcUc-WntsFHO_C0LE7xFJeor-oElt0LY/edit#gid=0
> >>>  
> >>> 
> >> 
> >> Thanks for these numbers! 
> >> Intra-Node > Inter-Node is indeed weird. I haven't observed such an 
> >> inversion before. 
> > 
> > As far as I know, it's been there since the machines were deployed 
> > despite obviously being a bug.  I know people at LLNL regard it as a 
> > bug, but it has not been their top priority (presumably at least in part 
> > because applications have not clearly expressed the impact of latency 
> > regressions on their science). 
> 
> 
> 



Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Smith, Barry F. via petsc-dev

  Here is how the bandwidth improves with more cores. Terrible in going from 1 
to 2 cores per socket




> On Sep 21, 2019, at 2:03 PM, Zhang, Junchao  wrote:
>
> I made the following changes:
> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>   PetscFunctionReturn(0);
> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
> swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE 
> is blocking, I changed the order to have better overlap.
>   ierr = 
> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr = 
> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
> 3) Log time directly in the test code so we can also know execution time 
> without -log_view (hence cuda synchronization). I manually calculated the 
> Total Mflop/s for these cases for easy comparison.
>
> <>
>
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 6 MPI ranks,
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 24 MPI ranks
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 42 MPI ranks
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 6 MPI ranks + 6 GPUs + regular SF + log_view
> MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
> 2.69e+02 100
> VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
> 0.00e+00  0
> VecCopyFromSome  100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  14  0  0  0  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
>
> 6 MPI ranks + 6 GPUs + regular SF  + No log_view
> MatMult: 100 1.0 1.4180e-01   
>   399268
>
> 6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
> MatMult  100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  1 99 97 18  0 100100100100  0 512224   6420750 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   6  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  16  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 6 MPI ranks + 6 GPUs + 

Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Smith, Barry F. via petsc-dev


  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao  wrote:
> 
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets. 
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
> 
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
> 
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F.  wrote:
> 
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine. 
> 
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
> 
> 
> > On Sep 21, 2019, at 11:11 PM, Jed Brown  wrote:
> > 
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> > 
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> > 
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> > 
> > Is this 60% of Triad the best we can get for SpMV?
> > 
> > "Zhang, Junchao via petsc-dev"  writes:
> > 
> >> 42 cores have better performance.
> >> 
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> 
> >> --Junchao Zhang
> >> 
> >> 
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov>> wrote:
> >> 
> >>  Junchao,
> >> 
> >>Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >> 
> >>  Barry
> >> 
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of 
> >> the GPUs for the multiply for this problem size.
> >> 
> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> >>> mailto:mfad...@lbl.gov>> wrote:
> >>> 
> >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> >>> saturated at that point.
> >>> 
> >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> >>> mailto:petsc-dev@mcs.anl.gov>> wrote:
> >>> Here are CPU version results on one node with 24 cores, 42 cores. Click 
> >>> the links for core layout.
> >>> 
> >>> 24 MPI ranks, 
> >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> >>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> >>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> >>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> 
> >>> 42 MPI ranks, 
> >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> >>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> >>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> >>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> >>> 0.0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junchao,
> >>
> >>Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of 
> >> the GPUs for the multiply for this problem size.
> >>
> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> >>> mailto:mfad...@lbl.gov>>>
> >>>  wrote:
> >>>
> >>> I came up

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Smith, Barry F. via petsc-dev



  Junchao,

Great, thanks

   Barry

  Eventually I think this should all got into a MR that includes these tests 
and the PetscSF ping-pongs so later someone can reproduce these numbers on 
Summit and on the new machines that come out.

> On Sep 23, 2019, at 11:01 AM, Zhang, Junchao  wrote:
> 
> I also did OpenMP stream test and then I found mismatch between OpenMPI and 
> MPI.  That reminded me a subtle issue on summit: pair of cores share L2 
> cache.  One has to place MPI ranks to different pairs to get best bandwidth. 
> See different bindings
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
> node has 21 cores. I assume that means 11 pairs. The new results are below. 
> They match with we what I got from OpenMPI. The bandwidth is almost doubled 
> from 1 to 2 cores per socket. IBM document also says each socket has two 
> memory controllers. I could not find the core-memory controller affinity 
> info. I tried different bindings but did not find huge difference.
>   
> #Ranks  Rate (MB/s)Ratio over 2 ranks
> 1 29229.8   -
> 2 59091.0  1.0
> 4112260.7  1.9
> 6159852.8  2.7
> 8194351.7  3.3
> 10   215841.0  3.7
> 12   232316.6  3.9
> 14   244615.7  4.1
> 16   254450.8  4.3
> 18   262185.7  4.4
> 20   267181.0  4.5
> 22   270290.4  4.6
> 24   221944.9  3.8
> 26   238302.8  4.0
> 
> 
> --Junchao Zhang
> 
> 
> On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F.  wrote:
> 
>   Junchao,
> 
>  For completeness could you please run with a single core? But leave the 
> ratio as you have with over 2 ranks since that is the correct model.
> 
>Thanks
> 
>  Barry
> 
> 
> > On Sep 22, 2019, at 11:14 AM, Zhang, Junchao  wrote:
> > 
> > I did stream test on Summit. I used the MPI version from petsc, but largely 
> > increased the array size N since one socket of Summit has 120MB L3 cache. I 
> > used MPI version since it was easy for me to distribute ranks evenly to the 
> > two sockets. 
> > The result matches with data released by OLCF (see attached figure) and 
> > data given by Jed. We can see the bandwidth saturates around 24 ranks.
> > 
> > #Ranks Rate (MB/s) Ratio over 2 ranks
> > --
> > 2  59012.28341.00
> > 4  70959.14751.20
> > 6 106639.98371.81
> > 8 138638.69292.35
> > 10171125.08732.90
> > 12196162.51973.32
> > 14215272.78103.65
> > 16229562.40403.89
> > 18242587.49134.11
> > 20251057.17314.25
> > 22258569.77944.38
> > 24265443.29244.50
> > 26266562.78724.52
> > 28267043.63674.53
> > 30266833.72124.52
> > 32267183.84744.53
> > 
> > On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F.  wrote:
> > 
> >   Junchao could try the PETSc (and non-PETSc) streams tests on the machine. 
> > 
> >   There are a few differences, compiler, the reported results are with 
> > OpenMP, different number of cores but yes the performance is a bit low. For 
> > DOE that is great, makes GPUs look better :-)
> > 
> > 
> > > On Sep 21, 2019, at 11:11 PM, Jed Brown  wrote:
> > > 
> > > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > > GB/s for the node for the best case (42 ranks).
> > > 
> > > My understanding is that these systems have 8 channels of DDR4-2666 per
> > > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > > system, and 270 GB/s STREAM Triad according to this post
> > > 
> > >  
> > > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> > > 
> > > Is this 60% of Triad the best we can get for SpMV?
> > > 
> > > "Zhang, Junchao via petsc-dev"  writes:
> > > 
> > >> 42 cores have better performance.
> > >> 
> > >> 36 MPI ranks
> > >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> > >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+00
> > >> 0 0.00e+00  0
> > >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> > >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+00
> > >> 0 0.00e+00  0
> > >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+00
> > >> 0 0.00e+00  0
> > >> 
> > >> --Junchao Zhang
> > >> 
> > >> 
> > >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> > >> mailto:bsm...@mcs.anl.gov>> wrote:
> > >> 
> > >>  Junchao,
> > >> 
> > >>Mark has a good point; could you also try for completeness the CPU 
> > >> with 36 cores and see if it is any

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junchao,
> >>
> >>Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard


On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
>

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
OK, I wrote to the OLCF Consultants and they told me that

* Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones.

and, from this I can conclude that

* If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of 
the cores in different resource sets will share L2/L3 cache.

* For the above case, in which I want 6 MPI ranks that don't share anything, I 
need to ask for 6 resource sets each with *2 cores* and 1 GPU each. When I ask 
for 2 cores, each resource set will consist of 2 cores that share L2/L3, so 
this is how you can get resource sets that don't share L2/L3 between them.

--Richard

On 9/23/19 11:10 AM, Mills, Richard Tran wrote:
To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard


On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.6367   

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I look closer. In 
MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls 
around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed region). I 
believe this is another area in which we get a meaningless timing. It looks 
like we need a WaitForGPU() there, and then maybe inside the timed region 
handling the scatter. (I don't know if this stuff happens asynchronously or 
not.) But do we potentially want two WaitForGPU() calls in one function, just 
to help with getting timings? I don't have a good idea of how much overhead 
this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mark Adams via petsc-dev
Note, the numerical problems that we have look a lot like a race condition
of some sort. Happens with empty processors and goes away under
cuda-memcheck (valgrind like thing).

I did try adding WaitForGPU() , but maybe I did do it right or there are
other synchronization mechanisms.


On Mon, Sep 23, 2019 at 6:28 PM Zhang, Junchao via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end
>> of the function it had
>>
>>   if (!yy) { /* MatMult */
>> if (!cusparsestruct->stream) {
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>> }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult() with no
>> add case and should just do this all the time, for the purposes of timing
>> if no other reason. Is there some reason to NOT do this because of worries
>> the about effects that these WaitForGPU() invocations might have on
>> performance?
>>
>> I notice other problems in aijcusparse.cu, now that I look closer. In
>> MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls
>> around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed
>> region). I believe this is another area in which we get a meaningless
>> timing. It looks like we need a WaitForGPU() there, and then maybe inside
>> the timed region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two WaitForGPU() calls
>> in one function, just to help with getting timings? I don't have a good
>> idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>
>> I made the following changes:
>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>   PetscFunctionReturn(0);
>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old
>> code swapped the first two lines. Since with
>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the order to
>> have better overlap.
>>   ierr =
>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>   ierr =
>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>> 3) Log time directly in the test code so we can also know execution
>> time without -log_view (hence cuda synchronization). I manually calculated
>> the Total Mflop/s for these cases for easy comparison.
>>
>> <>
>>
>>
>> 
>> EventCount  Time (sec) Flop
>>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
>> GpuToCpu - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>> Count   Size  %F
>>
>> ---
>> 6 MPI ranks,
>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05
>> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>>
>> 24 MPI ranks
>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04
>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>>
>> 42 MPI ranks
>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04
>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04
>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 8.5184e-0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Karl Rupp via petsc-dev

Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters 
create their own streams. This will almost inevitably create races 
(there is no synchronization mechanism implemented), unless one calls 
WaitForGPU() after each operation. Some of the non-deterministic tests 
can likely be explained by this.


I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
I'm no CUDA expert (not yet, anyway), but, from what I've read, the 
default stream (stream 0) is (mostly) synchronous to host and device, so 
WaitForGPU() is not needed in that case. I don't know if there is any 
performance penalty in explicitly calling it in that case, anyway.


In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() 
to ensure that the calls actually complete. I will make a pass through 
the aijcusparse and aijviennacl code looking for these.


--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't 
know logic of the "if (!cusparsestruct->stream)".

--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
the end of the function it had

  if (!yy) { /* MatMult */
    if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
    }
  }

I assume we don't need the logic to do this only in the MatMult()
with no add case and should just do this all the time, for the
purposes of timing if no other reason. Is there some reason to NOT
do this because of worries the about effects that these
WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu ,
now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
see that we have GPU timing calls around the cusparse_csr_spmv()
(but no WaitForGPU() inside the timed region). I believe this is
another area in which we get a meaningless timing. It looks like
we need a WaitForGPU() there, and then maybe inside the timed
region handling the scatter. (I don't know if this stuff happens
asynchronously or not.) But do we potentially want two
WaitForGPU() calls in one function, just to help with getting
timings? I don't have a good idea of how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:

I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
order to have better overlap.
  ierr =

VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr =

VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.

<>



Event                Count      Time (sec)     Flop  
               --- Global ---  --- Stage   Total   GPU    -

CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess  
AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s

Count   Size   Count   Size  %F

---
6 MPI ranks,
MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0
 0 0.00e+00    0 0.00e+00  0

VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0
 0 0.00e+00    0 0.00e+00  0

VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0
 0 0.00e+00    0 0.00e+00  0


24 MPI ranks
MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04
5.9e+04 0.0e+00

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
No objection. Thanks.
--Junchao Zhang


On Mon, Sep 23, 2019 at 10:09 PM Karl Rupp 
mailto:r...@iue.tuwien.ac.at>> wrote:
Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters
create their own streams. This will almost inevitably create races
(there is no synchronization mechanism implemented), unless one calls
WaitForGPU() after each operation. Some of the non-deterministic tests
can likely be explained by this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
> I'm no CUDA expert (not yet, anyway), but, from what I've read, the
> default stream (stream 0) is (mostly) synchronous to host and device, so
> WaitForGPU() is not needed in that case. I don't know if there is any
> performance penalty in explicitly calling it in that case, anyway.
>
> In any case, it looks like there are still some cases where potentially
> asynchronous CUDA library calls are being "timed" without a WaitForGPU()
> to ensure that the calls actually complete. I will make a pass through
> the aijcusparse and aijviennacl code looking for these.
>
> --Richard
>
> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>> It looks cusparsestruct->stream is always created (not NULL).  I don't
>> know logic of the "if (!cusparsestruct->stream)".
>> --Junchao Zhang
>>
>>
>> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev
>> mailto:petsc-dev@mcs.anl.gov> 
>> >> wrote:
>>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
>> the end of the function it had
>>
>>   if (!yy) { /* MatMult */
>> if (!cusparsestruct->stream) {
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>> }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult()
>> with no add case and should just do this all the time, for the
>> purposes of timing if no other reason. Is there some reason to NOT
>> do this because of worries the about effects that these
>> WaitForGPU() invocations might have on performance?
>>
>> I notice other problems in aijcusparse.cu 
>> ,
>> now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
>> see that we have GPU timing calls around the cusparse_csr_spmv()
>> (but no WaitForGPU() inside the timed region). I believe this is
>> another area in which we get a meaningless timing. It looks like
>> we need a WaitForGPU() there, and then maybe inside the timed
>> region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two
>> WaitForGPU() calls in one function, just to help with getting
>> timings? I don't have a good idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>> I made the following changes:
>>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   PetscFunctionReturn(0);
>>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
>>> The old code swapped the first two lines. Since with
>>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
>>> order to have better overlap.
>>>   ierr =
>>> 
>>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>>   ierr =
>>> 
>>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>> 3) Log time directly in the test code so we can also know
>>> execution time without -log_view (hence cuda synchronization). I
>>> manually calculated the Total Mflop/s for these cases for easy
>>> comparison.
>>>
>>> <>
>>>
>>> 
>>> 
>>> EventCount  Time (sec) Flop
>>>--- Global ---  --- Stage   Total   GPU-
>>> CpuToGpu -   - GpuToCpu - GPU
>>>Max Ratio  Max Ratio   Max  Ratio  Mess
>>> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
>>> Count   Size   Count   Size  %F
>>> 
>>> ---
>>> 6 MPI ranks,
>>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>>> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0
>>>  0 0.00e+000 0.00e+00  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
Karl, that would be fantastic. Much obliged!

--Richard

On 9/23/19 8:09 PM, Karl Rupp wrote:
Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters create 
their own streams. This will almost inevitably create races (there is no 
synchronization mechanism implemented), unless one calls WaitForGPU() after 
each operation. Some of the non-deterministic tests can likely be explained by 
this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov> 
> wrote:

In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
the end of the function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult()
with no add case and should just do this all the time, for the
purposes of timing if no other reason. Is there some reason to NOT
do this because of worries the about effects that these
WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu 
,
now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
see that we have GPU timing calls around the cusparse_csr_spmv()
(but no WaitForGPU() inside the timed region). I believe this is
another area in which we get a meaningless timing. It looks like
we need a WaitForGPU() there, and then maybe inside the timed
region handling the scatter. (I don't know if this stuff happens
asynchronously or not.) But do we potentially want two
WaitForGPU() calls in one function, just to help with getting
timings? I don't have a good idea of how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
order to have better overlap.
  ierr =

VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr =

VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.

<>



EventCount  Time (sec) Flop 
--- Global ---  --- Stage   Total   GPU-
CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess  AvgLen  
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
Count   Size   Count   Size  %F

---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0 0 
0.00e+000 0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0 0 
0.00e+000 0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  3  0  0  0  0  13  

Re: [petsc-dev] MatMult on Summit

2019-09-24 Thread Mark Adams via petsc-dev
Yes, please, thank you.

On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Karl, that would be fantastic. Much obliged!
>
> --Richard
>
> On 9/23/19 8:09 PM, Karl Rupp wrote:
>
> Hi,
>
> `git grep cudaStreamCreate` reports that vectors, matrices and scatters
> create their own streams. This will almost inevitably create races (there
> is no synchronization mechanism implemented), unless one calls WaitForGPU()
> after each operation. Some of the non-deterministic tests can likely be
> explained by this.
>
> I'll clean this up in the next few hours if there are no objections.
>
> Best regards,
> Karli
>
>
>
> On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
>
> I'm no CUDA expert (not yet, anyway), but, from what I've read, the
> default stream (stream 0) is (mostly) synchronous to host and device, so
> WaitForGPU() is not needed in that case. I don't know if there is any
> performance penalty in explicitly calling it in that case, anyway.
>
> In any case, it looks like there are still some cases where potentially
> asynchronous CUDA library calls are being "timed" without a WaitForGPU() to
> ensure that the calls actually complete. I will make a pass through the
> aijcusparse and aijviennacl code looking for these.
>
> --Richard
>
> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>
> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov 
> > wrote:
>
> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
> the end of the function it had
>
>   if (!yy) { /* MatMult */
> if (!cusparsestruct->stream) {
>   ierr = WaitForGPU();CHKERRCUDA(ierr);
> }
>   }
>
> I assume we don't need the logic to do this only in the MatMult()
> with no add case and should just do this all the time, for the
> purposes of timing if no other reason. Is there some reason to NOT
> do this because of worries the about effects that these
> WaitForGPU() invocations might have on performance?
>
> I notice other problems in aijcusparse.cu 
> ,
> now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
> see that we have GPU timing calls around the cusparse_csr_spmv()
> (but no WaitForGPU() inside the timed region). I believe this is
> another area in which we get a meaningless timing. It looks like
> we need a WaitForGPU() there, and then maybe inside the timed
> region handling the scatter. (I don't know if this stuff happens
> asynchronously or not.) But do we potentially want two
> WaitForGPU() calls in one function, just to help with getting
> timings? I don't have a good idea of how much overhead this adds.
>
> --Richard
>
> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>
> I made the following changes:
> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>   PetscFunctionReturn(0);
> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
> The old code swapped the first two lines. Since with
> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
> order to have better overlap.
>   ierr =
>
> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr =
>
> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
> 3) Log time directly in the test code so we can also know
> execution time without -log_view (hence cuda synchronization). I
> manually calculated the Total Mflop/s for these cases for easy
> comparison.
>
> <>
>
>
> 
> EventCount  Time (sec) Flop
>  --- Global ---  --- Stage   Total   GPU-
> CpuToGpu -   - GpuToCpu - GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess
> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
> Count   Size   Count   Size  %F
>
> ---
> 6 MPI ranks,
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0 0
> 0.00e+000 0.00e+00  0
> VecScatterBegin  100 1.0 4.914

Re: [petsc-dev] MatMult on Summit

2019-09-24 Thread Karl Rupp via petsc-dev

Hi Mark, Richard, Junchao, et al.,

here we go:
https://gitlab.com/petsc/petsc/merge_requests/2091

This fixes indeed all the inconsistencies in test results for SNES ex19 
and even ex56. A-priori I wasn't sure about the latter, but it looks 
like this was the only missing piece.


Mark, this should allow you to move forward with GPUs.

Best regards,
Karli



On 9/24/19 11:05 AM, Mark Adams wrote:

Yes, please, thank you.

On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


Karl, that would be fantastic. Much obliged!

--Richard

On 9/23/19 8:09 PM, Karl Rupp wrote:

Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and
scatters create their own streams. This will almost inevitably
create races (there is no synchronization mechanism implemented),
unless one calls WaitForGPU() after each operation. Some of the
non-deterministic tests can likely be explained by this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:

I'm no CUDA expert (not yet, anyway), but, from what I've read,
the default stream (stream 0) is (mostly) synchronous to host and
device, so WaitForGPU() is not needed in that case. I don't know
if there is any performance penalty in explicitly calling it in
that case, anyway.

In any case, it looks like there are still some cases where
potentially asynchronous CUDA library calls are being "timed"
without a WaitForGPU() to ensure that the calls actually
complete. I will make a pass through the aijcusparse and
aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:

It looks cusparsestruct->stream is always created (not NULL).  I
don't know logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via
petsc-dev mailto:petsc-dev@mcs.anl.gov>
 >
wrote:

    In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
    the end of the function it had

      if (!yy) { /* MatMult */
        if (!cusparsestruct->stream) {
      ierr = WaitForGPU();CHKERRCUDA(ierr);
        }
      }

    I assume we don't need the logic to do this only in the
MatMult()
    with no add case and should just do this all the time, for the
    purposes of timing if no other reason. Is there some reason
to NOT
    do this because of worries the about effects that these
    WaitForGPU() invocations might have on performance?

    I notice other problems in aijcusparse.cu
 
,
    now that I look closer. In
MatMultTransposeAdd_SeqAIJCUSPARSE(), I
    see that we have GPU timing calls around the
cusparse_csr_spmv()
    (but no WaitForGPU() inside the timed region). I believe
this is
    another area in which we get a meaningless timing. It looks
like
    we need a WaitForGPU() there, and then maybe inside the timed
    region handling the scatter. (I don't know if this stuff
happens
    asynchronously or not.) But do we potentially want two
    WaitForGPU() calls in one function, just to help with getting
    timings? I don't have a good idea of how much overhead this
adds.

    --Richard

    On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:

    I made the following changes:
    1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at
the end
      ierr = WaitForGPU();CHKERRCUDA(ierr);
      ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
      ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
      PetscFunctionReturn(0);
    2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
    The old code swapped the first two lines. Since with
    -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed
the
    order to have better overlap.
      ierr =
   
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);

      ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
      ierr =
   
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);

      ierr =
(*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
    3) Log time directly in the test code so we can also know
    execution time without -log_view (hence cuda
synchronization). I
    manually calculated the Total Mflop/s for these cases for easy
    comparison.

    <>

   

    Event                Count      Time (s