Junchao,

     Without GPU aware MPI, is it moving the entire vector to the CPU and doing 
the scatter and moving everything back or does it just move up exactly what 
needs to be sent to the other ranks and move back exactly what it received from 
other ranks?

    It is moving 4.74e+02 * 1e+6 bytes total data up and then down. Is that a 
reasonable amount?

    Why is it moving 800 distinct counts up and 800 distinct counts down when 
the MatMult is done 400 times, shouldn't it be 400 counts?

  Mark,

     Can you run both with GPU aware MPI?

   
  Norm, AXPY, pointwisemult roughly the same.


> On Jan 23, 2022, at 11:24 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> Ugh, try again. Still a big difference, but less.  Mat-vec does not change 
> much.
> 
> On Sun, Jan 23, 2022 at 7:12 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
>  You have debugging turned on on crusher but not permutter
> 
>> On Jan 23, 2022, at 6:37 PM, Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test. 
>> (small)
>> This is with 8 processes. 
>> 
>> * The next largest version of this test, 16M eq total and 8 processes, fails 
>> in memory allocation in the mat-mult setup in the Kokkos Mat.
>> 
>> * If I try to run with 64 processes on Perlmutter I get this error in 
>> initialization. These nodes have 160 Gb of memory.
>> (I assume this is related to these large memory requirements from loading 
>> packages, etc....)
>> 
>> Thanks,
>> Mark
>> 
>> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13 
>> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4 
>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 
>> 6 -dm_view -pc_type jacobi -log
>> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos -dm_vec_type 
>> kokkos -log_trace
>> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
>> [48]PETSC ERROR: --------------------- Error Message 
>> --------------------------------------------------------------
>> [48]PETSC ERROR: GPU error 
>> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
>> [48]PETSC ERROR: See https://petsc.org/release/faq/ 
>> <https://petsc.org/release/faq/> for trouble shooting.
>> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8  
>> GIT Date: 2022-01-22 12:18:02 -0600
>> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on a 
>> arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23 
>> 15:19:56 2022
>> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2 
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2 
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler 
>> -rdynamic -DLANDAU_DIM=2 -DLAN
>> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC 
>> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91 
>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc 
>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1 
>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1 
>> --with-zlib=1 --download-kokkos --download-kokkos-kernels 
>> --with-kokkos-kernels-tpl=0 --with-
>> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>> [48]PETSC ERROR: #1 initialize() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
>> [48]PETSC ERROR: #2 initialize() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
>> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
>> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
>> [48]PETSC ERROR: #5 PetscInitialize_Common() at 
>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
>> [48]PETSC ERROR: #6 PetscInitialize() at 
>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>> 
>> 
>> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> 
>> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>    I cleaned up Mark's last run and put it in a fixed-width font. I realize 
>> this may be too difficult but it would be great to have identical runs to 
>> compare with on Summit.
>> 
>> I was planning on running this on Perlmutter today, as well as some sanity 
>> checks like all GPUs are being used. I'll try PetscDeviceView.
>> 
>> Junchao modified the timers and all GPU > CPU now, but he seemed to move the 
>> timers more outside and Barry wants them tight on the "kernel".
>> I think Junchao is going to work on that so I will hold off.
>> (I removed the the Kokkos wait stuff and seemed to run a little faster but I 
>> am not sure how deterministic the timers are, and I did a test with GAMG and 
>> it was fine.)
>> 
>> 
>> 
>>    As Jed noted Scatter takes a long time but the pack and unpack take no 
>> time? Is this not timed if using Kokkos?
>> 
>> 
>> --- Event Stage 2: KSP Solve only
>> 
>> MatMult              400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04 
>> 0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242      0 0.00e+00    0 
>> 0.00e+00 100
>> VecScatterBegin      400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04 
>> 0.0e+00  0  0 61 54  0   7  0100100        0         0      0 0.00e+00    0 
>> 0.00e+00  0
>> VecScatterEnd        400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   9  0  0  0        0         0      0 0.00e+00    0 
>> 0.00e+00  0
>> SFPack               400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0        0         0      0 0.00e+00    0 
>> 0.00e+00  0
>> SFUnpack             400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0        0         0      0 0.00e+00    0 
>> 0.00e+00  0
>> 
>> KSPSolve               2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04 
>> 1.2e+03  3 60 61 54 60 100100100      73,592   116,796      0 0.00e+00    0 
>> 0.00e+00 100
>> VecTDot              802 1.0 1.3551e+00 1.2 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 8.0e+02  0  2  0  0 40  10  3  0      19,627    52,599      0 0.00e+00    0 
>> 0.00e+00 100
>> VecNorm              402 1.0 9.0151e-01 2.2 1.69e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  0  1  0  0 20   5  1  0  0   14,788   125,477      0 0.00e+00    0 
>> 0.00e+00 100
>> VecAXPY              800 1.0 8.2617e-01 1.0 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   7  3  0  0   32,112    61,644      0 0.00e+00    0 
>> 0.00e+00 100
>> VecAYPX              398 1.0 8.1525e-01 1.6 1.67e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   5  1  0  0   16,190    20,689      0 0.00e+00    0 
>> 0.00e+00 100
>> VecPointwiseMult     402 1.0 3.5694e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   3  1  0  0   18,675    38,633      0 0.00e+00    0 
>> 0.00e+00 100
>> 
>> 
>> 
>>> On Jan 22, 2022, at 12:40 PM, Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> wrote:
>>> 
>>> And I have a new MR with if you want to see what I've done so far.
>> 
>> <jac_out_001_kokkos_Crusher_6_1_notpl.txt><jac_out_001_kokkos_Perlmutter_6_1.txt>
> 
> <jac_out_001_kokkos_Crusher_6_1_notpl.txt>

Reply via email to