Sorry, I meant 24 CPU only
> On Jul 30, 2019, at 9:19 AM, Mark Adams <mfad...@lbl.gov> wrote: > > > > On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote: > > Thanks. Could you please send the 24 processors with the GPU? > > That is in out_cuda_000024.... > > > Note the final column of the table gives you the percentage of flops (not > rates, actual operations) on the GPU. For you biggest run it is > > For the MatMult it is 18 percent and for KSP solve it is 23 percent. I > think this is much too low, we'd like to see well over 90 percent of the > flops on the GPU; or 95 or more. Is this because you are forced to put very > large matrices only the CPU? > > Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on > the CPU. This could be because it is > 99.5%. And there is this in the last > solve phase: > > MatMult 679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04 > 0.0e+00 1 39 14 8 0 3 74 79 60 0 16438647 438720307 578 1.99e+02 > 519 2.55e+02 18 > MatMultAdd 150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03 > 0.0e+00 0 2 2 0 0 1 3 10 1 0 3409019 191195194 120 2.48e+01 > 60 2.25e+00 21 > MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03 > 0.0e+00 0 2 2 0 0 0 3 10 1 0 6867795 2539317196 38 1.02e+02 > 150 3.22e+00 92 > > I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well > over 90% should be on the GPU. I am puzzled. I'll keep digging but the log > statements look OK. > > > For the MatMult if we assume the flop rate for the GPU is 25 times as fast > as the CPU and 18 percent of the flops are done on the GPU then the ratio of > time for the GPU should be 82.7 percent of the time for the CPU but it is > .90; so where is the extra time? Seems too much than just for the > communication. > > I don't follow this analysis but the there is something funny about the > logging ... > > > There is so much information and so much happening in the final stage that > it is hard to discern what is killing the performance in the GPU case for the > KSP solve. Anyway you can just have a stage at the end with several KSP > solves and nothing else? > > I added this, eg, > > --- Event Stage 7: KSP only > > SFBcastOpBegin 263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03 > 0.0e+00 0 0 15 7 0 1 0 91 98 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 8 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02 > 0.0e+00 0 0 2 0 0 0 0 9 2 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFReduceEnd 48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMult 215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03 > 0.0e+00 1 24 14 7 0 83 89 81 95 0 33405 177859 430 1.75e+01 358 > 2.23e+01 17 > MatMultAdd 48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02 > 0.0e+00 0 1 2 0 0 7 5 9 2 0 20318 106989 48 2.33e+00 48 > 2.24e-01 20 > MatMultTranspose 48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02 > 0.0e+00 0 1 2 0 0 2 4 9 2 0 55325 781863 0 0.00e+00 72 > 3.23e-01 93 > MatSolve 24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 2810 0 0 0.00e+00 0 > 0.00e+00 0 > MatResidual 48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03 > 0.0e+00 0 5 3 1 0 17 19 18 20 0 33284 136803 96 3.62e+00 72 > 4.50e+00 19 > VecTDot 46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00 > 4.6e+01 0 0 0 0 2 1 0 0 0 66 4109 6814 0 0.00e+00 0 > 0.00e+00 100 > VecNorm 24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00 > 2.4e+01 0 0 0 0 1 1 0 0 0 34 2507 5050 0 0.00e+00 0 > 0.00e+00 100 > VecCopy 146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 24 > 9.87e-02 0 > VecSet 169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > VecAXPY 46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 15870 23070 0 0.00e+00 0 > 0.00e+00 100 > VecAYPX 310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 3 1 0 0 0 7273 12000 48 1.97e-01 0 > 0.00e+00 100 > VecAXPBYCZ 96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 1 1 0 0 0 20134 46381 0 0.00e+00 0 > 0.00e+00 100 > VecPointwiseMult 192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 1 0 0 0 0 3886 4184 24 9.87e-02 0 > 0.00e+00 100 > VecScatterBegin 311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03 > 0.0e+00 0 0 17 7 0 2 0100100 0 0 0 0 0.00e+00 72 > 3.50e-01 0 > VecScatterEnd 311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > VecCUDACopyTo 550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 550 2.01e+01 0 > 0.00e+00 0 > VecCUDACopyFrom 478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 478 > 2.29e+01 0 > VecCopyFromSome 24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 24 > 1.26e-01 0 > KSPSolve 1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03 > 7.0e+01 1 28 17 7 3 100100100100100 31476 83700 550 2.01e+01 502 > 2.30e+01 23 > PCSetUpOnBlocks 24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > PCApply 24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03 > 0.0e+00 1 23 16 6 0 83 84 91 86 0 32127 96704 504 1.71e+01 456 > 1.88e+01 24 > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > Barry > > > > On Jul 29, 2019, at 5:26 PM, Mark Adams <mfad...@lbl.gov> wrote: > > > > > > > > On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote: > > > > I don't understand the notation in the legend on the second page > > > > 12,288 cpus and no GPUs ? > > > > Yes > > > > > > 24 GPUs? or 6 GPUs > > > > 24 virtual, 6 real GPUs per node. The first case is one node, 24 cores/vGPUs > > > > > > 192 GPUs? > > > > 1536 GPUs? > > > > 12,288 GPUs? or 12288/4 = 3072 GPUs? > > > > All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072 > > physical GPUs. > > > > Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU) > > > > > > So on the largest run using GPUs or not takes pretty much exactly the same > > amount of time? > > > > yes. The raw Mat-vec is about 3x faster with ~95K equations/process. I've > > attached the data. > > > > > > What about 6 GPUs vs 24 CPUs ? Same equal amount of time. > > > > Can you send some log summaries > > > > <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288> >