Sorry, I meant 24 CPU only


> On Jul 30, 2019, at 9:19 AM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> 
> 
> On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> 
>   Thanks. Could you please send the 24 processors with the GPU? 
> 
> That is in  out_cuda_000024....
> 
> 
>    Note the final column of the table gives you the percentage of flops (not 
> rates, actual operations) on the GPU. For you biggest run it is
> 
>    For the MatMult it is 18 percent and for KSP solve it is 23 percent. I 
> think this is much too low, we'd like to see well over 90 percent of the 
> flops on the GPU; or 95 or more. Is this because you are forced to put very 
> large matrices only the CPU? 
> 
> Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on 
> the CPU. This could be because it is > 99.5%. And there is this in the last 
> solve phase:
> 
> MatMult              679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04 
> 0.0e+00  1 39 14  8  0   3 74 79 60  0 16438647   438720307    578 1.99e+02  
> 519 2.55e+02 18
> MatMultAdd           150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03 
> 0.0e+00  0  2  2  0  0   1  3 10  1  0 3409019   191195194    120 2.48e+01   
> 60 2.25e+00 21
> MatMultTranspose     150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03 
> 0.0e+00  0  2  2  0  0   0  3 10  1  0 6867795   2539317196     38 1.02e+02  
> 150 3.22e+00 92
>  
> I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well 
> over 90% should be on the GPU. I am puzzled. I'll keep digging but the log 
> statements look OK.
> 
> 
>    For the MatMult if we assume the flop rate for the GPU is 25 times as fast 
> as the CPU and 18 percent of the flops are done on the GPU then the ratio of 
> time for the GPU should be 82.7 percent of the time for the CPU but  it is 
> .90; so where is the extra time? Seems too much than just for the 
> communication. 
> 
> I don't follow this analysis but the there is something funny about the 
> logging ...
>  
> 
>    There is so much information and so much happening in the final stage that 
> it is hard to discern what is killing the performance in the GPU case for the 
> KSP solve. Anyway you can just have a stage at the end with several KSP 
> solves and nothing else? 
> 
> I added this, eg, 
> 
> --- Event Stage 7: KSP only
> 
> SFBcastOpBegin       263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03 
> 0.0e+00  0  0 15  7  0   1  0 91 98  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> SFBcastOpEnd         263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   8  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> SFReduceBegin         48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02 
> 0.0e+00  0  0  2  0  0   0  0  9  2  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> SFReduceEnd           48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatMult              215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03 
> 0.0e+00  1 24 14  7  0  83 89 81 95  0 33405   177859    430 1.75e+01  358 
> 2.23e+01 17
> MatMultAdd            48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02 
> 0.0e+00  0  1  2  0  0   7  5  9  2  0 20318   106989     48 2.33e+00   48 
> 2.24e-01 20
> MatMultTranspose      48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02 
> 0.0e+00  0  1  2  0  0   2  4  9  2  0 55325   781863      0 0.00e+00   72 
> 3.23e-01 93
> MatSolve              24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2810       0      0 0.00e+00    0 
> 0.00e+00  0
> MatResidual           48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03 
> 0.0e+00  0  5  3  1  0  17 19 18 20  0 33284   136803     96 3.62e+00   72 
> 4.50e+00 19
> VecTDot               46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00 
> 4.6e+01  0  0  0  0  2   1  0  0  0 66  4109    6814      0 0.00e+00    0 
> 0.00e+00 100
> VecNorm               24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00 
> 2.4e+01  0  0  0  0  1   1  0  0  0 34  2507    5050      0 0.00e+00    0 
> 0.00e+00 100
> VecCopy              146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00   24 
> 9.87e-02  0
> VecSet               169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> VecAXPY               46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 15870   23070      0 0.00e+00    0 
> 0.00e+00 100
> VecAYPX              310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  1  0  0  0  7273   12000     48 1.97e-01    0 
> 0.00e+00 100
> VecAXPBYCZ            96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  1  0  0  0 20134   46381      0 0.00e+00    0 
> 0.00e+00 100
> VecPointwiseMult     192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0  3886    4184     24 9.87e-02    0 
> 0.00e+00 100
> VecScatterBegin      311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03 
> 0.0e+00  0  0 17  7  0   2  0100100  0     0       0      0 0.00e+00   72 
> 3.50e-01  0
> VecScatterEnd        311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   9  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> VecCUDACopyTo        550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    550 2.01e+01    0 
> 0.00e+00  0
> VecCUDACopyFrom      478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0      0 0.00e+00  478 
> 2.29e+01  0
> VecCopyFromSome       24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   24 
> 1.26e-01  0
> KSPSolve               1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03 
> 7.0e+01  1 28 17  7  3 100100100100100 31476   83700    550 2.01e+01  502 
> 2.30e+01 23
> PCSetUpOnBlocks       24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> PCApply               24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03 
> 0.0e+00  1 23 16  6  0  83 84 91 86  0 32127   96704    504 1.71e+01  456 
> 1.88e+01 24
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>  
> 
>    Barry
> 
> 
> > On Jul 29, 2019, at 5:26 PM, Mark Adams <mfad...@lbl.gov> wrote:
> > 
> > 
> > 
> > On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > 
> >   I don't understand the notation in the legend on the second page
> > 
> > 12,288 cpus and no GPUs ?
> > 
> > Yes
> >  
> > 
> > 24 GPUs?  or 6 GPUs
> > 
> > 24 virtual, 6 real GPUs per node. The first case is one node, 24 cores/vGPUs
> >  
> > 
> > 192 GPUs?
> > 
> > 1536 GPUs?
> > 
> > 12,288 GPUs?  or 12288/4 = 3072  GPUs?
> > 
> > All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072 
> > physical GPUs.
> > 
> > Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU)
> >  
> > 
> > So on the largest run using GPUs or not takes pretty much exactly the same 
> > amount of  time?
> > 
> > yes. The raw Mat-vec is about 3x faster with ~95K equations/process. I've 
> > attached the data.
> >  
> > 
> > What about 6 GPUs vs 24 CPUs ? Same equal amount of time. 
> > 
> > Can you send some log summaries
> > 
> > <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288>
> 

Reply via email to