On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > Thanks. Could you please send the 24 processors with the GPU? > That is in out_cuda_000024.... > Note the final column of the table gives you the percentage of flops > (not rates, actual operations) on the GPU. For you biggest run it is > > For the MatMult it is 18 percent and for KSP solve it is 23 percent. I > think this is much too low, we'd like to see well over 90 percent of the > flops on the GPU; or 95 or more. Is this because you are forced to put very > large matrices only the CPU? > Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on the CPU. This could be because it is > 99.5%. And there is this in the last solve phase: MatMult 679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04 0.0e+00 1 39 14 8 0 3 74 79 60 0 16438647 438720307 578 1.99e+02 519 2.55e+02 18 MatMultAdd 150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03 0.0e+00 0 2 2 0 0 1 3 10 1 0 3409019 191195194 120 2.48e+01 60 2.25e+00 21 MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03 0.0e+00 0 2 2 0 0 0 3 10 1 0 6867795 2539317196 38 1.02e+02 150 3.22e+00 92 I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well over 90% should be on the GPU. I am puzzled. I'll keep digging but the log statements look OK. > For the MatMult if we assume the flop rate for the GPU is 25 times as > fast as the CPU and 18 percent of the flops are done on the GPU then the > ratio of time for the GPU should be 82.7 percent of the time for the CPU > but it is .90; so where is the extra time? Seems too much than just for > the communication. > I don't follow this analysis but the there is something funny about the logging ... > > There is so much information and so much happening in the final stage > that it is hard to discern what is killing the performance in the GPU case > for the KSP solve. Anyway you can just have a stage at the end with several > KSP solves and nothing else? > I added this, eg, --- Event Stage 7: KSP only SFBcastOpBegin 263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03 0.0e+00 0 0 15 7 0 1 0 91 98 0 0 0 0 0.00e+00 0 0.00e+00 0 SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 8 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02 0.0e+00 0 0 2 0 0 0 0 9 2 0 0 0 0 0.00e+00 0 0.00e+00 0 SFReduceEnd 48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatMult 215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03 0.0e+00 1 24 14 7 0 83 89 81 95 0 33405 177859 430 1.75e+01 358 2.23e+01 17 MatMultAdd 48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02 0.0e+00 0 1 2 0 0 7 5 9 2 0 20318 106989 48 2.33e+00 48 2.24e-01 20 MatMultTranspose 48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02 0.0e+00 0 1 2 0 0 2 4 9 2 0 55325 781863 0 0.00e+00 72 3.23e-01 93 MatSolve 24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2810 0 0 0.00e+00 0 0.00e+00 0 MatResidual 48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03 0.0e+00 0 5 3 1 0 17 19 18 20 0 33284 136803 96 3.62e+00 72 4.50e+00 19 VecTDot 46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00 4.6e+01 0 0 0 0 2 1 0 0 0 66 4109 6814 0 0.00e+00 0 0.00e+00 100 VecNorm 24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 1 0 0 0 34 2507 5050 0 0.00e+00 0 0.00e+00 100 VecCopy 146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 24 9.87e-02 0 VecSet 169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecAXPY 46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 15870 23070 0 0.00e+00 0 0.00e+00 100 VecAYPX 310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 1 0 0 0 7273 12000 48 1.97e-01 0 0.00e+00 100 VecAXPBYCZ 96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 1 0 0 0 20134 46381 0 0.00e+00 0 0.00e+00 100 VecPointwiseMult 192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 3886 4184 24 9.87e-02 0 0.00e+00 100 VecScatterBegin 311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03 0.0e+00 0 0 17 7 0 2 0100100 0 0 0 0 0.00e+00 72 3.50e-01 0 VecScatterEnd 311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecCUDACopyTo 550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 550 2.01e+01 0 0.00e+00 0 VecCUDACopyFrom 478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 478 2.29e+01 0 VecCopyFromSome 24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 24 1.26e-01 0 KSPSolve 1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03 7.0e+01 1 28 17 7 3 100100100100100 31476 83700 550 2.01e+01 502 2.30e+01 23 PCSetUpOnBlocks 24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 PCApply 24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03 0.0e+00 1 23 16 6 0 83 84 91 86 0 32127 96704 504 1.71e+01 456 1.88e+01 24 --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Barry > > > > On Jul 29, 2019, at 5:26 PM, Mark Adams <mfad...@lbl.gov> wrote: > > > > > > > > On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <bsm...@mcs.anl.gov> > wrote: > > > > I don't understand the notation in the legend on the second page > > > > 12,288 cpus and no GPUs ? > > > > Yes > > > > > > 24 GPUs? or 6 GPUs > > > > 24 virtual, 6 real GPUs per node. The first case is one node, 24 > cores/vGPUs > > > > > > 192 GPUs? > > > > 1536 GPUs? > > > > 12,288 GPUs? or 12288/4 = 3072 GPUs? > > > > All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072 > physical GPUs. > > > > Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU) > > > > > > So on the largest run using GPUs or not takes pretty much exactly the > same > > amount of time? > > > > yes. The raw Mat-vec is about 3x faster with ~95K equations/process. > I've attached the data. > > > > > > What about 6 GPUs vs 24 CPUs ? Same equal amount of time. > > > > Can you send some log summaries > > > > > <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288> > >