Re: [petsc-dev] MatPinToCPU

Mark Adams via petsc-dev Tue, 30 Jul 2019 07:20:14 -0700

On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:


>
>   Thanks. Could you please send the 24 processors with the GPU?
>

That is in  out_cuda_000024....


>    Note the final column of the table gives you the percentage of flops
> (not rates, actual operations) on the GPU. For you biggest run it is
>
>    For the MatMult it is 18 percent and for KSP solve it is 23 percent. I
> think this is much too low, we'd like to see well over 90 percent of the
> flops on the GPU; or 95 or more. Is this because you are forced to put very
> large matrices only the CPU?
>

Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on
the CPU. This could be because it is > 99.5%. And there is this in the last
solve phase:

MatMult              679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04
0.0e+00  1 39 14  8  0   3 74 79 60  0 16438647   438720307    578 1.99e+02
 519 2.55e+02 18
MatMultAdd           150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03
0.0e+00  0  2  2  0  0   1  3 10  1  0 3409019   191195194    120 2.48e+01
  60 2.25e+00 21
MatMultTranspose     150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03
0.0e+00  0  2  2  0  0   0  3 10  1  0 6867795   2539317196     38 1.02e+02
 150 3.22e+00 92

I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well
over 90% should be on the GPU. I am puzzled. I'll keep digging but the log
statements look OK.


>    For the MatMult if we assume the flop rate for the GPU is 25 times as
> fast as the CPU and 18 percent of the flops are done on the GPU then the
> ratio of time for the GPU should be 82.7 percent of the time for the CPU
> but  it is .90; so where is the extra time? Seems too much than just for
> the communication.
>

I don't follow this analysis but the there is something funny about the
logging ...


>
>    There is so much information and so much happening in the final stage
> that it is hard to discern what is killing the performance in the GPU case
> for the KSP solve. Anyway you can just have a stage at the end with several
> KSP solves and nothing else?
>

I added this, eg,

--- Event Stage 7: KSP only

SFBcastOpBegin       263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03
0.0e+00  0  0 15  7  0   1  0 91 98  0     0       0      0 0.00e+00    0
0.00e+00  0
SFBcastOpEnd         263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   8  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
SFReduceBegin         48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02
0.0e+00  0  0  2  0  0   0  0  9  2  0     0       0      0 0.00e+00    0
0.00e+00  0
SFReduceEnd           48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
MatMult              215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03
0.0e+00  1 24 14  7  0  83 89 81 95  0 33405   177859    430 1.75e+01  358
2.23e+01 17
MatMultAdd            48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02
0.0e+00  0  1  2  0  0   7  5  9  2  0 20318   106989     48 2.33e+00   48
2.24e-01 20
MatMultTranspose      48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02
0.0e+00  0  1  2  0  0   2  4  9  2  0 55325   781863      0 0.00e+00   72
3.23e-01 93
MatSolve              24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  2810       0      0 0.00e+00    0
0.00e+00  0
MatResidual           48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03
0.0e+00  0  5  3  1  0  17 19 18 20  0 33284   136803     96 3.62e+00   72
4.50e+00 19
VecTDot               46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00
4.6e+01  0  0  0  0  2   1  0  0  0 66  4109    6814      0 0.00e+00    0
0.00e+00 100
VecNorm               24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00
2.4e+01  0  0  0  0  1   1  0  0  0 34  2507    5050      0 0.00e+00    0
0.00e+00 100
VecCopy              146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00   24
9.87e-02  0
VecSet               169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
VecAXPY               46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 15870   23070      0 0.00e+00    0
0.00e+00 100
VecAYPX              310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   3  1  0  0  0  7273   12000     48 1.97e-01    0
0.00e+00 100
VecAXPBYCZ            96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  1  0  0  0 20134   46381      0 0.00e+00    0
0.00e+00 100
VecPointwiseMult     192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  0  0  0  0  3886    4184     24 9.87e-02    0
0.00e+00 100
VecScatterBegin      311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03
0.0e+00  0  0 17  7  0   2  0100100  0     0       0      0 0.00e+00   72
3.50e-01  0
VecScatterEnd        311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   9  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
VecCUDACopyTo        550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    550 2.01e+01    0
0.00e+00  0
VecCUDACopyFrom      478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0      0 0.00e+00  478
2.29e+01  0
VecCopyFromSome       24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   24
1.26e-01  0
KSPSolve               1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03
7.0e+01  1 28 17  7  3 100100100100100 31476   83700    550 2.01e+01  502
2.30e+01 23
PCSetUpOnBlocks       24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
PCApply               24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03
0.0e+00  1 23 16  6  0  83 84 91 86  0 32127   96704    504 1.71e+01  456
1.88e+01 24
---------------------------------------------------------------------------------------------------------------------------------------------------------------



>
>    Barry
>
>
> > On Jul 29, 2019, at 5:26 PM, Mark Adams <mfad...@lbl.gov> wrote:
> >
> >
> >
> > On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <bsm...@mcs.anl.gov>
> wrote:
> >
> >   I don't understand the notation in the legend on the second page
> >
> > 12,288 cpus and no GPUs ?
> >
> > Yes
> >
> >
> > 24 GPUs?  or 6 GPUs
> >
> > 24 virtual, 6 real GPUs per node. The first case is one node, 24
> cores/vGPUs
> >
> >
> > 192 GPUs?
> >
> > 1536 GPUs?
> >
> > 12,288 GPUs?  or 12288/4 = 3072  GPUs?
> >
> > All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072
> physical GPUs.
> >
> > Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU)
> >
> >
> > So on the largest run using GPUs or not takes pretty much exactly the
> same
> > amount of  time?
> >
> > yes. The raw Mat-vec is about 3x faster with ~95K equations/process.
> I've attached the data.
> >
> >
> > What about 6 GPUs vs 24 CPUs ? Same equal amount of time.
> >
> > Can you send some log summaries
> >
> >
> <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288>
>
>

Re: [petsc-dev] MatPinToCPU

Reply via email to