Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith

  So the MPI is killing you in going from 8 to 64. (The GPU flop rate scales 
almost perfectly, but the overall flop rate is only half of what it should be 
at 64).

> On Jan 25, 2022, at 9:24 PM, Mark Adams  wrote:
> 
> It looks like we have our instrumentation and job configuration in decent 
> shape so on to scaling with AMG.
> In using multiple nodes I got errors with table entries not found, which can 
> be caused by a buggy MPI, and the problem does go away when I turn GPU aware 
> MPI off.
> Jed's analysis, if I have this right, is that at 0.7T flops we are at about 
> 35% of theoretical peal wrt memory bandwidth.
> I run out of memory with the next step in this study (7 levels of 
> refinement), with 2M equations per GPU. This seems low to me and we will see 
> if we can fix this.
> So this 0.7Tflops is with only 1/4 M equations so 35% is not terrible.
> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6 levels of 
> refinement.
> 
> out_001_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.2933e+00 1.0 
> 4.13e+10 1.1 1.8e+05 8.4e+03 5.8e+02  3 87 86 78 48 100100100100100 248792   
> 423857   6840 3.85e+02 6792 3.85e+02 100
> out_001_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 5.3667e+00 1.0 
> 3.89e+11 1.0 2.1e+05 3.3e+04 6.7e+02  2 87 86 79 48 100100100100100 571572   
> 72   7920 1.74e+03 7920 1.74e+03 100
> out_008_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.9407e+00 1.0 
> 4.94e+10 1.1 3.5e+06 6.2e+03 6.7e+02  5 87 86 79 47 100100100100100 1581096   
> 3034723   7920 6.88e+02 7920 6.88e+02 100
> out_008_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 7.4478e+00 1.0 
> 4.49e+11 1.0 4.1e+06 2.3e+04 7.6e+02  2 88 87 80 49 100100100100100 3798162   
> 5557106   9367 3.02e+03 9359 3.02e+03 100
> out_064_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 2.4551e+00 1.0 
> 5.40e+10 1.1 4.2e+07 5.4e+03 7.3e+02  5 88 87 80 47 100100100100100 11065887  
>  23792978   8684 8.90e+02 8683 8.90e+02 100
> out_064_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 1.1335e+01 1.0 
> 5.38e+11 1.0 5.4e+07 2.0e+04 9.1e+02  4 88 88 82 49 100100100100100 24130606  
>  43326249   11249 4.26e+03 11249 4.26e+03 100
> 
> On Tue, Jan 25, 2022 at 1:49 PM Mark Adams  > wrote:
> 
> Note that Mark's logs have been switching back and forth between 
> -use_gpu_aware_mpi and changing number of ranks -- we won't have that 
> information if we do manual timing hacks. This is going to be a routine thing 
> we'll need on the mailing list and we need the provenance to go with it.
> 
> GPU aware MPI crashes sometimes so to be safe, while debugging, I had it off. 
> It works fine here so it has been on in the last tests.
> Here is a comparison.
>  
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
>
>
> Note that Mark's logs have been switching back and forth between
> -use_gpu_aware_mpi and changing number of ranks -- we won't have that
> information if we do manual timing hacks. This is going to be a routine
> thing we'll need on the mailing list and we need the provenance to go with
> it.
>

GPU aware MPI crashes sometimes so to be safe, while debugging, I had it
off. It works fine here so it has been on in the last tests.
Here is a comparison.
Script started on 2022-01-25 13:44:31-05:00 [TERM="xterm-256color" 
TTY="/dev/pts/0" COLUMNS="296" LINES="100"]
13:44 adams/aijkokkos-gpu-logging *= 
crusher:/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
 exitbash -x 
run_crusher_jac.sbatchexitbash -x 
run_crusher_jac.sbatch
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output 
(/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ NG=8
+ NC=1
+ date
Tue 25 Jan 2022 01:44:38 PM EST
+ EXTRA='-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi 
-pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_coarsen_type PMIS 
-pc_hypre_boomeramg_no_CF'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_no_CF true 
-pc_hypre_boomeramg_strong_threshold 0.75 -pc_hypre_boomeramg_agg_nl 1 
-pc_hypre_boomeramg_coarsen_type HMIS -pc_hypre_boomeramg_interp_type ext+i '
+ for REFINE in 5
+ for NPIDX in 1
+ let 'N1 = 1 * 1'
++ bc -l
+ PG=2.
++ printf %.0f 2.
+ PG=2
+ let 'NCC = 8 / 1'
+ let 'N4 = 2 * 1'
+ let 'NODES = 1 * 1 * 1'
+ let 'N = 1 * 1 * 8'
+ echo n= 8 ' NODES=' 1 ' NC=' 1 ' PG=' 2
n= 8  NODES= 1  NC= 1  PG= 2
++ printf %03d 1
+ foo=001
+ srun -n8 -N1 --ntasks-per-gpu=1 --gpu-bind=closest -c 8 ../ex13 
-dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2 
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 5 
-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true -dm_mat_type aijkokkos 
-dm_vec_type kokkos -pc_type jacobi
+ tee jac_out_001_kokkos_Crusher_5_1_noview.txt
DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
Solve time: 0.34211
#PETSc Option Table entries:
-benchmark_it 2
-dm_distribute
-dm_mat_type aijkokkos
-dm_plex_box_faces 2,2,

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
Here are two runs, without and with -log_view, respectively.
My new timer is "Solve time = "
About 10% difference

On Tue, Jan 25, 2022 at 12:53 PM Mark Adams  wrote:

> BTW, a -device_view would be great.
>
> On Tue, Jan 25, 2022 at 12:30 PM Mark Adams  wrote:
>
>>
>>
>> On Tue, Jan 25, 2022 at 11:56 AM Jed Brown  wrote:
>>
>>> Barry Smith  writes:
>>>
>>> >   Thanks Mark, far more interesting. I've improved the formatting to
>>> make it easier to read (and fixed width font for email reading)
>>> >
>>> >   * Can you do same run with say 10 iterations of Jacobi PC?
>>> >
>>> >   * PCApply performance (looks like GAMG) is terrible! Problems too
>>> small?
>>>
>>> This is -pc_type jacobi.
>>>
>>> >   * VecScatter time is completely dominated by SFPack! Junchao what's
>>> up with that? Lots of little kernels in the PCApply? PCJACOBI run will help
>>> clarify where that is coming from.
>>>
>>> It's all in MatMult.
>>>
>>> I'd like to see a run that doesn't wait for the GPU.
>>>
>>>
>> Not sure what you mean. Can I do that?
>>
>>
>
Script started on 2022-01-25 13:33:45-05:00 [TERM="xterm-256color" 
TTY="/dev/pts/0" COLUMNS="296" LINES="100"]
13:33 adams/aijkokkos-gpu-logging *= 
crusher:/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
 bash -x run_crusher_jac.sbatch
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output 
(/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ NG=8
+ NC=1
+ date
Tue 25 Jan 2022 01:33:53 PM EST
+ EXTRA='-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi 
-pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_coarsen_type PMIS 
-pc_hypre_boomeramg_no_CF'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_no_CF true 
-pc_hypre_boomeramg_strong_threshold 0.75 -pc_hypre_boomeramg_agg_nl 1 
-pc_hypre_boomeramg_coarsen_type HMIS -pc_hypre_boomeramg_interp_type ext+i '
+ for REFINE in 5
+ for NPIDX in 1
+ let 'N1 = 1 * 1'
++ bc -l
+ PG=2.
++ printf %.0f 2.
+ PG=2
+ let 'NCC = 8 / 1'
+ let 'N4 = 2 * 1'
+ let 'NODES = 1 * 1 * 1'
+ let 'N = 1 * 1 * 8'
+ echo n= 8 ' NODES=' 1 ' NC=' 1 ' PG=' 2
n= 8  NODES= 1  NC= 1  PG= 2
++ printf %03d 1
+ foo=001
+ srun -n8 -N1 --ntasks-per-gpu=1 --gpu-bind=closest -c 8 ../ex13 
-dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2 
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 5 
-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true -dm_mat_type aijkokkos 
-dm_vec_type kokkos -pc_type jacobi
+ tee jac_out_001_kokkos_Crusher_5_1_noview.txt
DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED no

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith  writes:

>> What is the command line option to turn 
>> PetscLogGpuTimeBegin/PetscLogGpuTimeEnd into a no-op even when -log_view is 
>> on? I know it'll mess up attribution, but it'll still tell us how long the 
>> solve took.
>
>   We don't have an API for this yet. It is slightly tricky because turning it 
> off will also break the regular -log_view for some stuff like VecAXPY(); 
> anything that doesn't have a needed synchronization with the CPU.) 

Of course it will misattribute time, but the high-level (like KSPSolve) is 
still useful. We need an option for this so we can still have -log_view output.

Note that Mark's logs have been switching back and forth between 
-use_gpu_aware_mpi and changing number of ranks -- we won't have that 
information if we do manual timing hacks. This is going to be a routine thing 
we'll need on the mailing list and we need the provenance to go with it.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
BTW, a -device_view would be great.

On Tue, Jan 25, 2022 at 12:30 PM Mark Adams  wrote:

>
>
> On Tue, Jan 25, 2022 at 11:56 AM Jed Brown  wrote:
>
>> Barry Smith  writes:
>>
>> >   Thanks Mark, far more interesting. I've improved the formatting to
>> make it easier to read (and fixed width font for email reading)
>> >
>> >   * Can you do same run with say 10 iterations of Jacobi PC?
>> >
>> >   * PCApply performance (looks like GAMG) is terrible! Problems too
>> small?
>>
>> This is -pc_type jacobi.
>>
>> >   * VecScatter time is completely dominated by SFPack! Junchao what's
>> up with that? Lots of little kernels in the PCApply? PCJACOBI run will help
>> clarify where that is coming from.
>>
>> It's all in MatMult.
>>
>> I'd like to see a run that doesn't wait for the GPU.
>>
>>
> Not sure what you mean. Can I do that?
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith



> On Jan 25, 2022, at 12:25 PM, Jed Brown  wrote:
> 
> Barry Smith  writes:
> 
>>> On Jan 25, 2022, at 11:55 AM, Jed Brown  wrote:
>>> 
>>> Barry Smith  writes:
>>> 
 Thanks Mark, far more interesting. I've improved the formatting to make it 
 easier to read (and fixed width font for email reading)
 
 * Can you do same run with say 10 iterations of Jacobi PC?
 
 * PCApply performance (looks like GAMG) is terrible! Problems too small?
>>> 
>>> This is -pc_type jacobi.
>> 
>>  Dang, how come it doesn't warn about all the gamg arguments passed to the 
>> program? I saw them and jump to the wrong conclusion.
> 
> We don't have -options_left by default. Mark has a big .petscrc or 
> PETSC_OPTIONS.
> 
>>  How come PCApply is so low while Pointwise mult (which should be all of 
>> PCApply) is high?
> 
> I also think that's weird.
> 
>>> 
 * VecScatter time is completely dominated by SFPack! Junchao what's up 
 with that? Lots of little kernels in the PCApply? PCJACOBI run will help 
 clarify where that is coming from.
>>> 
>>> It's all in MatMult.
>>> 
>>> I'd like to see a run that doesn't wait for the GPU.
>> 
>>  Indeed
> 
> What is the command line option to turn 
> PetscLogGpuTimeBegin/PetscLogGpuTimeEnd into a no-op even when -log_view is 
> on? I know it'll mess up attribution, but it'll still tell us how long the 
> solve took.

  We don't have an API for this yet. It is slightly tricky because turning it 
off will also break the regular -log_view for some stuff like VecAXPY(); 
anything that doesn't have a needed synchronization with the CPU.) 

  Because of this I think Mark should just put a PetscTime() around KSPSolve 
run without -log_view and we can compare that number to the one from -log_view 
to see how much the synchronousness of PetscLogGPUTime is causing. Ad hoc yes, 
but a quick easy way to get the information.

> 
> Also, can we make WaitForKokkos a no-op? I don't think it's necessary for 
> correctness (docs indicate kokkos::fence synchronizes).



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
On Tue, Jan 25, 2022 at 11:56 AM Jed Brown  wrote:

> Barry Smith  writes:
>
> >   Thanks Mark, far more interesting. I've improved the formatting to
> make it easier to read (and fixed width font for email reading)
> >
> >   * Can you do same run with say 10 iterations of Jacobi PC?
> >
> >   * PCApply performance (looks like GAMG) is terrible! Problems too
> small?
>
> This is -pc_type jacobi.
>
> >   * VecScatter time is completely dominated by SFPack! Junchao what's up
> with that? Lots of little kernels in the PCApply? PCJACOBI run will help
> clarify where that is coming from.
>
> It's all in MatMult.
>
> I'd like to see a run that doesn't wait for the GPU.
>
>
Not sure what you mean. Can I do that?


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith  writes:

>> On Jan 25, 2022, at 11:55 AM, Jed Brown  wrote:
>> 
>> Barry Smith  writes:
>> 
>>>  Thanks Mark, far more interesting. I've improved the formatting to make it 
>>> easier to read (and fixed width font for email reading)
>>> 
>>>  * Can you do same run with say 10 iterations of Jacobi PC?
>>> 
>>>  * PCApply performance (looks like GAMG) is terrible! Problems too small?
>> 
>> This is -pc_type jacobi.
>
>   Dang, how come it doesn't warn about all the gamg arguments passed to the 
> program? I saw them and jump to the wrong conclusion.

We don't have -options_left by default. Mark has a big .petscrc or 
PETSC_OPTIONS.

>   How come PCApply is so low while Pointwise mult (which should be all of 
> PCApply) is high?

I also think that's weird.

>> 
>>>  * VecScatter time is completely dominated by SFPack! Junchao what's up 
>>> with that? Lots of little kernels in the PCApply? PCJACOBI run will help 
>>> clarify where that is coming from.
>> 
>> It's all in MatMult.
>> 
>> I'd like to see a run that doesn't wait for the GPU.
>
>   Indeed

What is the command line option to turn PetscLogGpuTimeBegin/PetscLogGpuTimeEnd 
into a no-op even when -log_view is on? I know it'll mess up attribution, but 
it'll still tell us how long the solve took.

Also, can we make WaitForKokkos a no-op? I don't think it's necessary for 
correctness (docs indicate kokkos::fence synchronizes).


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith



> On Jan 25, 2022, at 11:55 AM, Jed Brown  wrote:
> 
> Barry Smith  writes:
> 
>>  Thanks Mark, far more interesting. I've improved the formatting to make it 
>> easier to read (and fixed width font for email reading)
>> 
>>  * Can you do same run with say 10 iterations of Jacobi PC?
>> 
>>  * PCApply performance (looks like GAMG) is terrible! Problems too small?
> 
> This is -pc_type jacobi.

  Dang, how come it doesn't warn about all the gamg arguments passed to the 
program? I saw them and jump to the wrong conclusion.

  How come PCApply is so low while Pointwise mult (which should be all of 
PCApply) is high?

  
> 
>>  * VecScatter time is completely dominated by SFPack! Junchao what's up with 
>> that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
>> where that is coming from.
> 
> It's all in MatMult.
> 
> I'd like to see a run that doesn't wait for the GPU.

  Indeed

> 
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 
>> MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
>> 1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212  1 2.04e-04
>> 0 0.00e+00 100
>> KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
>> 6.0e+02 12100 99 79 94 100100100100100 449,667893,741  1 2.04e-04
>> 0 0.00e+00 100
>> PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
>> 2.0e+00  2  1  0  0  0  18  1  0  0  0  14,55816,3941  0 0.00e+00
>> 0 0.00e+00 100
>> VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716353,914  0 0.00e+00
>> 0 0.00e+00 100
>> VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
>> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325303,155  0 0.00e+00
>> 0 0.00e+00 100
>> VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091514,744  0 0.00e+00
>> 0 0.00e+00 100
>> VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323532,889  0 0.00e+00
>> 0 0.00e+00 100
>> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882290,088  0 0.00e+00
>> 0 0.00e+00 100
>> VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
>> 1.0e+00  2  0 99 79  0  19  0100100  0   0  0  1 2.04e-04
>> 0 0.00e+00  0
>> VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   1  0  0  0  0   0  0  0 0.00e+00
>> 0 0.00e+00  0
>> SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0  18  0  0  0  0   0  0  1 2.04e-04
>> 0 0.00e+00  0
>> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   0  0  0 0.00e+00
>> 0 0.00e+00  0
>> 
>> 
>>> On Jan 25, 2022, at 8:29 AM, Mark Adams  wrote:
>>> 
>>> adding Suyash,
>>> 
>>> I found the/a problem. Using ex56, which has a crappy decomposition, using 
>>> one MPI process/GPU is much faster than using 8 (64 total). (I am looking 
>>> at ex13 to see how much of this is due to the decomposition)
>>> If you only use 8 processes it seems that all 8 are put on the first GPU, 
>>> but adding -c8 seems to fix this.
>>> Now the numbers are looking reasonable.
>>> 
>>> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith >> > wrote:
>>> 
>>>  For this, to start, someone can run 
>>> 
>>> src/vec/vec/tutorials/performance.c 
>>> 
>>> and compare the performance to that in the technical report Evaluation of 
>>> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
>>> Vector Node Performance. Google to find. One does not have to and shouldn't 
>>> do an extensive study right now that compares everything, instead one 
>>> should run a very small number of different size problems (make them big) 
>>> and compare those sizes with what Summit gives. Note you will need to make 
>>> sure that performance.c uses the Kokkos backend.
>>> 
>>>  One hopes for better performance than Summit; if one gets tons worse we 
>>> know something is very wrong somewhere. I'd love to see some c

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith  writes:

>   Thanks Mark, far more interesting. I've improved the formatting to make it 
> easier to read (and fixed width font for email reading)
>
>   * Can you do same run with say 10 iterations of Jacobi PC?
>
>   * PCApply performance (looks like GAMG) is terrible! Problems too small?

This is -pc_type jacobi.

>   * VecScatter time is completely dominated by SFPack! Junchao what's up with 
> that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
> where that is coming from.

It's all in MatMult.

I'd like to see a run that doesn't wait for the GPU.

> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
>
> MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
> 1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212  1 2.04e-04
> 0 0.00e+00 100
> KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
> 6.0e+02 12100 99 79 94 100100100100100 449,667893,741  1 2.04e-04
> 0 0.00e+00 100
> PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+00  2  1  0  0  0  18  1  0  0  0  14,55816,3941  0 0.00e+00
> 0 0.00e+00 100
> VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716353,914  0 0.00e+00
> 0 0.00e+00 100
> VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325303,155  0 0.00e+00
> 0 0.00e+00 100
> VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091514,744  0 0.00e+00
> 0 0.00e+00 100
> VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323532,889  0 0.00e+00
> 0 0.00e+00 100
> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882290,088  0 0.00e+00
> 0 0.00e+00 100
> VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
> 1.0e+00  2  0 99 79  0  19  0100100  0   0  0  1 2.04e-04
> 0 0.00e+00  0
> VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0   0  0  0 0.00e+00
> 0 0.00e+00  0
> SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  18  0  0  0  0   0  0  1 2.04e-04
> 0 0.00e+00  0
> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   0  0  0 0.00e+00
> 0 0.00e+00  0
>
>
>> On Jan 25, 2022, at 8:29 AM, Mark Adams  wrote:
>> 
>> adding Suyash,
>> 
>> I found the/a problem. Using ex56, which has a crappy decomposition, using 
>> one MPI process/GPU is much faster than using 8 (64 total). (I am looking at 
>> ex13 to see how much of this is due to the decomposition)
>> If you only use 8 processes it seems that all 8 are put on the first GPU, 
>> but adding -c8 seems to fix this.
>> Now the numbers are looking reasonable.
>> 
>> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith > > wrote:
>> 
>>   For this, to start, someone can run 
>> 
>> src/vec/vec/tutorials/performance.c 
>> 
>> and compare the performance to that in the technical report Evaluation of 
>> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
>> Vector Node Performance. Google to find. One does not have to and shouldn't 
>> do an extensive study right now that compares everything, instead one should 
>> run a very small number of different size problems (make them big) and 
>> compare those sizes with what Summit gives. Note you will need to make sure 
>> that performance.c uses the Kokkos backend.
>> 
>>   One hopes for better performance than Summit; if one gets tons worse we 
>> know something is very wrong somewhere. I'd love to see some comparisons.
>> 
>>   Barry
>> 
>> 
>>> On Jan 24, 2022, at 3:06 PM, Justin Chang >> > wrote:
>>> 
>>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>>> you do. 
>>> 
>>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>>> PETSc. We have several teams actively working on rectifying this. However, 
>>> I think per

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith
  Thanks Mark, far more interesting. I've improved the formatting to make it 
easier to read (and fixed width font for email reading)

  * Can you do same run with say 10 iterations of Jacobi PC?

  * PCApply performance (looks like GAMG) is terrible! Problems too small?

  * VecScatter time is completely dominated by SFPack! Junchao what's up with 
that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
where that is coming from.


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---

MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212  1 2.04e-040 
0.00e+00 100
KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
6.0e+02 12100 99 79 94 100100100100100 449,667893,741  1 2.04e-040 
0.00e+00 100
PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
2.0e+00  2  1  0  0  0  18  1  0  0  0  14,55816,3941  0 0.00e+000 
0.00e+00 100
VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716353,914  0 0.00e+000 
0.00e+00 100
VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325303,155  0 0.00e+000 
0.00e+00 100
VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091514,744  0 0.00e+000 
0.00e+00 100
VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323532,889  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882290,088  0 0.00e+000 
0.00e+00 100
VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
1.0e+00  2  0 99 79  0  19  0100100  0   0  0  1 2.04e-040 
0.00e+00  0
VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   1  0  0  0  0   0  0  0 0.00e+000 
0.00e+00  0
SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  18  0  0  0  0   0  0  1 2.04e-040 
0.00e+00  0
SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   0  0  0 0.00e+000 
0.00e+00  0


> On Jan 25, 2022, at 8:29 AM, Mark Adams  wrote:
> 
> adding Suyash,
> 
> I found the/a problem. Using ex56, which has a crappy decomposition, using 
> one MPI process/GPU is much faster than using 8 (64 total). (I am looking at 
> ex13 to see how much of this is due to the decomposition)
> If you only use 8 processes it seems that all 8 are put on the first GPU, but 
> adding -c8 seems to fix this.
> Now the numbers are looking reasonable.
> 
> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith  > wrote:
> 
>   For this, to start, someone can run 
> 
> src/vec/vec/tutorials/performance.c 
> 
> and compare the performance to that in the technical report Evaluation of 
> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
> Vector Node Performance. Google to find. One does not have to and shouldn't 
> do an extensive study right now that compares everything, instead one should 
> run a very small number of different size problems (make them big) and 
> compare those sizes with what Summit gives. Note you will need to make sure 
> that performance.c uses the Kokkos backend.
> 
>   One hopes for better performance than Summit; if one gets tons worse we 
> know something is very wrong somewhere. I'd love to see some comparisons.
> 
>   Barry
> 
> 
>> On Jan 24, 2022, at 3:06 PM, Justin Chang > > wrote:
>> 
>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>> you do. 
>> 
>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>> PETSc. We have several teams actively working on rectifying this. However, I 
>> think performance issues can be quicker to identify if we had a more 
>> "official" and reproducible PETSc GPU benchmark, which I've already 
>> expressed to some folks in this thread, and as others already commented on 
>> the difficulty of such a task. Hopefully I will have more t

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
>
>
>
> > VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235882   290088  0 0.00e+000
> 0.00e+00 100
> > VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04
> 1.0e+00  2  0 99 79  0  19  0100100  0 0   0  1 2.04e-040
> 0.00e+00  0
> > VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> I'm curious how these change with problem size. (To what extent are we
> latency vs bandwidth limited?)
>
>
I am getting a segv in ex13 now, a Kokkos view in Plex, but will do scaling
tests when I get it going again.
(trying to get GAMG scaling for Todd by the 3rd)



> > SFSetUp1 1.0 1.3015e-03 1.3 0.00e+00 0.0 1.1e+02 1.7e+04
> 1.0e+00  0  0  1  0  0   0  0  1  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0  18  0  0  0  0 0   0  1 2.04e-040
> 0.00e+00  0
> > SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Mark Adams  writes:

> adding Suyash,
>
> I found the/a problem. Using ex56, which has a crappy decomposition, using
> one MPI process/GPU is much faster than using 8 (64 total). (I am looking
> at ex13 to see how much of this is due to the decomposition)
> If you only use 8 processes it seems that all 8 are put on the first GPU,
> but adding -c8 seems to fix this.
> Now the numbers are looking reasonable.

Hah, we need -log_view to report bus ID for each GPU so we don't spend another 
day of mailing list traffic to identify.

This looks to be 2-3x the performance of Spock.

> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---

[...]

> --- Event Stage 2: Solve
>
> BuildTwoSided  1 1.0 9.1706e-05 1.6 0.00e+00 0.0 5.6e+01 4.0e+00 
> 1.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
> 1.0e+00  9 92 99 79  0  71 92100100  0 579635   1014212  1 2.04e-040 
> 0.00e+00 100

GPU compute bandwidth of around 6 TB/s is okay, but disappointing that 
communication is so expensive.

> MatView1 1.0 7.8531e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
> 6.0e+02 12100 99 79 94 100100100100100 449667   893741  1 2.04e-040 
> 0.00e+00 100
> PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+00  2  1  0  0  0  18  1  0  0  0 14558   163941  0 0.00e+000 
> 0.00e+00 100
> VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183716   353914  0 0.00e+000 
> 0.00e+00 100
> VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222325   303155  0 0.00e+000 
> 0.00e+00 100
> VecCopy2 1.0 2.3551e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecSet 1 1.0 9.8740e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427091   514744  0 0.00e+000 
> 0.00e+00 100
> VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432323   532889  0 0.00e+000 
> 0.00e+00 100

These two are finally about the same speed, but these numbers imply kernel 
overhead of about 57 µs (because these do nothing else).

> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235882   290088  0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
> 1.0e+00  2  0 99 79  0  19  0100100  0 0   0  1 2.04e-040 
> 0.00e+00  0
> VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0

I'm curious how these change with problem size. (To what extent are we latency 
vs bandwidth limited?)

> SFSetUp1 1.0 1.3015e-03 1.3 0.00e+00 0.0 1.1e+02 1.7e+04 
> 1.0e+00  0  0  1  0  0   0  0  1  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  18  0  0  0  0 0   0  1 2.04e-040 
> 0.00e+00  0
> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
adding Suyash,

I found the/a problem. Using ex56, which has a crappy decomposition, using
one MPI process/GPU is much faster than using 8 (64 total). (I am looking
at ex13 to see how much of this is due to the decomposition)
If you only use 8 processes it seems that all 8 are put on the first GPU,
but adding -c8 seems to fix this.
Now the numbers are looking reasonable.

On Mon, Jan 24, 2022 at 3:24 PM Barry Smith  wrote:

>
>   For this, to start, someone can run
>
> src/vec/vec/tutorials/performance.c
>
> and compare the performance to that in the technical report Evaluation of
> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I:
> Vector Node Performance. Google to find. One does not have to and shouldn't
> do an extensive study right now that compares everything, instead one
> should run a very small number of different size problems (make them big)
> and compare those sizes with what Summit gives. Note you will need to make
> sure that performance.c uses the Kokkos backend.
>
>   One hopes for better performance than Summit; if one gets tons worse we
> know something is very wrong somewhere. I'd love to see some comparisons.
>
>   Barry
>
>
> On Jan 24, 2022, at 3:06 PM, Justin Chang  wrote:
>
> Also, do you guys have an OLCF liaison? That's actually your better bet if
> you do.
>
> Performance issues with ROCm/Kokkos are pretty common in apps besides just
> PETSc. We have several teams actively working on rectifying this. However,
> I think performance issues can be quicker to identify if we had a more
> "official" and reproducible PETSc GPU benchmark, which I've already
> expressed to some folks in this thread, and as others already commented on
> the difficulty of such a task. Hopefully I will have more time soon to
> illustrate what I am thinking.
>
> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang  wrote:
>
>> My name has been called.
>>
>> Mark, if you're having issues with Crusher, please contact Veronica
>> Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in
>> those emails
>>
>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith  wrote:
>>
>>>
>>>
>>> On Jan 24, 2022, at 2:46 PM, Mark Adams  wrote:
>>>
>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
>>> run this on one processor to get cleaner numbers.
>>>
>>> Is there a designated ECP technical support contact?
>>>
>>>
>>>Mark, you've forgotten you work for DOE. There isn't a non-ECP
>>> technical support contact.
>>>
>>>But if this is an AMD machine then maybe contact Matt's student
>>> Justin Chang?
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith  wrote:
>>>

   I think you should contact the crusher ECP technical support team and
 tell them you are getting dismel performance and ask if you should expect
 better. Don't waste time flogging a dead horse.

 On Jan 24, 2022, at 2:16 PM, Matthew Knepley  wrote:

 On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
 wrote:

>
>
> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <
>> junchao.zh...@gmail.com> wrote:
>>
>>> Mark, I think you can benchmark individual vector operations, and
>>> once we get reasonable profiling results, we can move to solvers etc.
>>>
>>
>> Can you suggest a code to run or are you suggesting making a vector
>> benchmark code?
>>
> Make a vector benchmark code, testing vector operations that would be
> used in your solver.
> Also, we can run MatMult() to see if the profiling result is
> reasonable.
> Only once we get some solid results on basic operations, it is useful
> to run big codes.
>

 So we have to make another throw-away code? Why not just look at the
 vector ops in Mark's actual code?

Matt


>
>>
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>>>


 On Mon, Jan 24, 2022 at 12:44 PM Barry Smith 
 wrote:

>
>   Here except for VecNorm the GPU is used effectively in that most
> of the time is time is spent doing real work on the GPU
>
> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
> 0.00e+000 0.00e+00 100
>
> Even the dots are very effective, only the VecNorm flop rate over
> the full time is much much lower than the vecdot. Which is somehow 
> due to
> the use of the GPU or CPU MPI in the allreduce?
>

 The VecNorm GPU rate is relatively high on Crusher and the CPU rate
 is about the same as the other vec ops. I don't know what to make of 
 that.

 But Crusher is clearly not crushing it.

>>>