Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Junchao Zhang
On Fri, Jan 21, 2022 at 8:08 PM Barry Smith  wrote:

>
>   Junchao, Mark,
>
>  Some of the logging information is non-sensible, MatMult says all
> flops are done on the GPU (last column) but the GPU flop rate is zero.
>
>  It looks like  MatMult_SeqAIJKokkos() is missing
> PetscLogGpuTimeBegin()/End() in fact all the operations in
> aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU
> flop rate. Can this be fixed ASAP?
>
I will add this profiling temporarily.  I may use Kokkos own profiling APIs
later.


>
>  Regarding VecOps, sure looks the kernel launches are killing
> performance.
>
>But in particular look at the VecTDot and VecNorm CPU flop
> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is
> likely hurting performance in there also a great deal. It would be good to
> see a single MPI rank job to compare to see performance without the MPI
> overhead.
>
>
>
>
>
>
>
> On Jan 21, 2022, at 6:41 PM, Mark Adams  wrote:
>
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
> MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
> are similar (mat-vec is a little faster w/o, the total is about the same,
> call it noise)
>
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
> cores on the node, then when using 1 core/GPU. With the same size problem
> of course.
> I was thinking MatMult should be faster with just one MPI process. Oh
> well, worry about that later.
>
> The bigger problem, and I have observed this to some extent with the
> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by
> not-mat-vec:
>
>
> 
> EventCount  Time (sec) Flop
>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
> GpuToCpu - GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
> Count   Size  %F
>
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
> 0.00e+000 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
> 0.00e+000 0.00e+00 100
>
> Notes about flop counters here,
> * that MatMult flops are not logged as GPU flops but something is logged
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
> at < 1%.
>
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
>
> Mark
>
>
> 
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

Interesting, Is this with all native Kokkos kernels or do some kokkos kernels 
use rocm? 

I ask because VecNorm is 4 times higher than VecDot, I would not expect that 
and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect 
that)


MatMult  400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 
0.0e+00  0 54  0  0  0  43 91  0  0  0 98964   0  0 0.00e+000 
0.00e+00 100
MatView2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
KSPSolve   2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 
0.0e+00  1 60  0  0  0 100100  0  0  0 46887   220,001  0 0.00e+000 
0.00e+00 100
VecTDot  802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0  20  3  0  0  0  6882   15,426  0 0.00e+000 
0.00e+00 100
VecNorm  402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   5  1  0  0  0 14281   62,757  0 0.00e+000 
0.00e+00 100
VecCopy4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecSet 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecAXPY  800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0  23  3  0  0  0  5880   14,666  0 0.00e+000 
0.00e+00 100
VecAYPX  398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   4  1  0  0  0 15284   71,218  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  1  0  0  0  7534   33,579  0 0.00e+000 
0.00e+00 100
PCApply  402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  1  0  0  0  7527   33,579  0 0.00e+000 
0.00e+00 100



> On Jan 21, 2022, at 9:46 PM, Mark Adams  wrote:
> 
> 
>But in particular look at the VecTDot and VecNorm CPU flop rates 
> compared to the GPU, much lower, this tells me the MPI_Allreduce is likely 
> hurting performance in there also a great deal. It would be good to see a 
> single MPI rank job to compare to see performance without the MPI overhead.
> 
> Here are two single processor runs, with a whole GPU. It's not clear of 
> --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).
>  
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

  Mark,

  Fix the logging before you run more. It will help with seeing the true 
disparity between the MatMult and the vector ops.


> On Jan 21, 2022, at 9:37 PM, Mark Adams  wrote:
> 
> Here is one with 2M / GPU. Getting better.
> 
> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  > wrote:
> 
>Matt is correct, vectors are way too small.
> 
>BTW: Now would be a good time to run some of the Report I benchmarks on 
> Crusher to get a feel for the kernel launch times and performance on VecOps.
> 
>Also Report 2.
> 
>   Barry
> 
> 
>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley > > wrote:
>> 
>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams > > wrote:
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
>> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
>> similar (mat-vec is a little faster w/o, the total is about the same, call 
>> it noise)
>> 
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
>> cores on the node, then when using 1 core/GPU. With the same size problem of 
>> course.
>> I was thinking MatMult should be faster with just one MPI process. Oh well, 
>> worry about that later.
>> 
>> The bigger problem, and I have observed this to some extent with the Landau 
>> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by 
>> not-mat-vec:
>> 
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
>> MatMult  400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
>> 0.0e+00  1 55 62 54  0  27 91100100  0 668874   0  0 0.00e+000 
>> 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve   2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
>> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405  0 0.00e+000 
>> 0.00e+00 100
>> 
>> Notes about flop counters here, 
>> * that MatMult flops are not logged as GPU flops but something is logged 
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are 
>> at < 1%.
>> 
>> This looks complicated, so just a single remark:
>> 
>> My understanding of the benchmarking of vector ops led by Hannah was that 
>> you needed to be much
>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I 
>> would think you would be
>> at 10% of peak or something right off the bat at these sizes. Barry, is that 
>> right?
>> 
>>   Thanks,
>> 
>>  Matt
>>  
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>> 
>> Mark
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ 
> 
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
>
>
>But in particular look at the VecTDot and VecNorm CPU flop
> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is
> likely hurting performance in there also a great deal. It would be good to
> see a single MPI rank job to compare to see performance without the MPI
> overhead.
>

Here are two single processor runs, with a whole GPU. It's not clear
of --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).
DM Object: box 1 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937
  Number of 1-cells per rank: 104544
  Number of 2-cells per rank: 101376
  Number of 3-cells per rank: 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (24480))
  Face Sets: 6 strata with value/size (6 (3600), 5 (3600), 3 (3600), 4 (3600), 
1 (3600), 2 (3600))
  Linear solve converged due to CONVERGED_RTOL iterations 122
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijkokkos
rows=250047, cols=250047
total: nonzeros=15069223, allocated nonzeros=15069223
total number of mallocs used during MatSetValues calls=0
  not using I-node routines
  Linear solve converged due to CONVERGED_RTOL iterations 122
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijkokkos
rows=250047, cols=250047
total: nonzeros=15069223, allocated nonzeros=15069223
total number of mallocs used during MatSetValues calls=0
  not using I-node routines
  Linear solve converged due to CONVERGED_RTOL iterations 122
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijkokkos
rows=250047, cols=250047
total: nonzeros=15069223, allocated nonzeros=15069223
total number of mallocs used during MatSetValues calls=0
  not using I-node routines

*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance Summary: 
--

/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
arch-olcf-crusher named crusher003 with 1 processor, by adams Fri Jan 21 
21:30:02 2022
Using Petsc Development GIT revision: v3.16.3-665-g1012189b9a  GIT Date: 
2022-01-21 16:28:20 +

 Max   Max/Min Avg   Total
Time (sec):   5.916e+01 1.000   5.916e+01
Objects:  1.637e+03 1.000   1.637e+03
Flop: 1.454e+10 1.000   1.454e+10  1.454e+10
Flop/sec: 2.459e+08 1.000   2.459e+08  2.459e+08
MPI Messages: 0.000e+00 0.000   0.000e+00  0.000e+00
MPI Message Lengths:  1.800e+01 1.000   0.000e+00  1.800e+01
MPI Reductions:   9.000e+00 1.000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N 
flop
and VecAXPY() for complex vectors of length N --> 
8N flop

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 5.8503e+01  98.9%  6.3978e+09  44.0%  0.000e+00   0.0%  
0.000e+00  100.0%  9.000e+00 100.0%
 1: PCSetUp: 2.0318e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  KSP Solve only: 6.3347e-01   1.1%  8.1469e+09  56.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%

--

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
Here is one with 2M / GPU. Getting better.

On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  wrote:

>
>Matt is correct, vectors are way too small.
>
>BTW: Now would be a good time to run some of the Report I benchmarks on
> Crusher to get a feel for the kernel launch times and performance on VecOps.
>
>Also Report 2.
>
>   Barry
>
>
> On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:
>
> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:
>
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>> MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
>> are similar (mat-vec is a little faster w/o, the total is about the same,
>> call it noise)
>>
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>> 64 cores on the node, then when using 1 core/GPU. With the same size
>> problem of course.
>> I was thinking MatMult should be faster with just one MPI process. Oh
>> well, worry about that later.
>>
>> The bigger problem, and I have observed this to some extent with the
>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by
>> not-mat-vec:
>>
>>
>> 
>> EventCount  Time (sec) Flop
>>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
>> GpuToCpu - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>> Count   Size  %F
>>
>> ---
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
>> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
>> 0.00e+000 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
>> 0.00e+000 0.00e+00 100
>>
>> Notes about flop counters here,
>> * that MatMult flops are not logged as GPU flops but something is logged
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>> are at < 1%.
>>
>
> This looks complicated, so just a single remark:
>
> My understanding of the benchmarking of vector ops led by Hannah was that
> you needed to be much
> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs
> I would think you would be
> at 10% of peak or something right off the bat at these sizes. Barry, is
> that right?
>
>   Thanks,
>
>  Matt
>
>
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>
>> Mark
>>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>
>
>
DM Object: box 64 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625
  Number of 1-cells per rank: 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200
  Number of 2-cells per rank: 798720 798720 798720 798720 798720 798720 798720 
798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 
798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 
798720 798720 798720 798720 798720 798720 798720 798720 798720 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

   Matt is correct, vectors are way too small.

   BTW: Now would be a good time to run some of the Report I benchmarks on 
Crusher to get a feel for the kernel launch times and performance on VecOps.

   Also Report 2.

  Barry


> On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:
> 
> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  > wrote:
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
> similar (mat-vec is a little faster w/o, the total is about the same, call it 
> noise)
> 
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
> cores on the node, then when using 1 core/GPU. With the same size problem of 
> course.
> I was thinking MatMult should be faster with just one MPI process. Oh well, 
> worry about that later.
> 
> The bigger problem, and I have observed this to some extent with the Landau 
> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by 
> not-mat-vec:
> 
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
> 0.0e+00  1 55 62 54  0  27 91100100  0 668874   0  0 0.00e+000 
> 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405  0 0.00e+000 
> 0.00e+00 100
> 
> Notes about flop counters here, 
> * that MatMult flops are not logged as GPU flops but something is logged 
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at 
> < 1%.
> 
> This looks complicated, so just a single remark:
> 
> My understanding of the benchmarking of vector ops led by Hannah was that you 
> needed to be much
> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I 
> would think you would be
> at 10% of peak or something right off the bat at these sizes. Barry, is that 
> right?
> 
>   Thanks,
> 
>  Matt
>  
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
> 
> Mark
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

  Junchao, Mark,

 Some of the logging information is non-sensible, MatMult says all flops 
are done on the GPU (last column) but the GPU flop rate is zero. 

 It looks like  MatMult_SeqAIJKokkos() is missing 
PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx 
seem to be missing it. This might explain the crazy 0 GPU flop rate. Can this 
be fixed ASAP?

 Regarding VecOps, sure looks the kernel launches are killing performance. 

   But in particular look at the VecTDot and VecNorm CPU flop rates 
compared to the GPU, much lower, this tells me the MPI_Allreduce is likely 
hurting performance in there also a great deal. It would be good to see a 
single MPI rank job to compare to see performance without the MPI overhead.







> On Jan 21, 2022, at 6:41 PM, Mark Adams  wrote:
> 
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
> similar (mat-vec is a little faster w/o, the total is about the same, call it 
> noise)
> 
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
> cores on the node, then when using 1 core/GPU. With the same size problem of 
> course.
> I was thinking MatMult should be faster with just one MPI process. Oh well, 
> worry about that later.
> 
> The bigger problem, and I have observed this to some extent with the Landau 
> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by 
> not-mat-vec:
> 
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
> 0.0e+00  1 55 62 54  0  27 91100100  0 668874   0  0 0.00e+000 
> 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405  0 0.00e+000 
> 0.00e+00 100
> 
> Notes about flop counters here, 
> * that MatMult flops are not logged as GPU flops but something is logged 
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at 
> < 1%.
> 
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
> 
> Mark
> 
> 
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Matthew Knepley
On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:

> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
> MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
> are similar (mat-vec is a little faster w/o, the total is about the same,
> call it noise)
>
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
> cores on the node, then when using 1 core/GPU. With the same size problem
> of course.
> I was thinking MatMult should be faster with just one MPI process. Oh
> well, worry about that later.
>
> The bigger problem, and I have observed this to some extent with the
> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by
> not-mat-vec:
>
>
> 
> EventCount  Time (sec) Flop
>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
> GpuToCpu - GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
> Count   Size  %F
>
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
> 0.00e+000 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
> 0.00e+000 0.00e+00 100
>
> Notes about flop counters here,
> * that MatMult flops are not logged as GPU flops but something is logged
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
> at < 1%.
>

This looks complicated, so just a single remark:

My understanding of the benchmarking of vector ops led by Hannah was that
you needed to be much
bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I
would think you would be
at 10% of peak or something right off the bat at these sizes. Barry, is
that right?

  Thanks,

 Matt


> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
>
> Mark
>
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


[petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
(ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
MI200?).
This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
are similar (mat-vec is a little faster w/o, the total is about the same,
call it noise)

I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
cores on the node, then when using 1 core/GPU. With the same size problem
of course.
I was thinking MatMult should be faster with just one MPI process. Oh well,
worry about that later.

The bigger problem, and I have observed this to some extent with the Landau
TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
expensive or crazy expensive.
You can see (attached) and the times here that the solve is dominated by
not-mat-vec:


EventCount  Time (sec) Flop
 --- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
Count   Size  %F
---
17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05 1.6e+04
0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0 0.00e+00
 0 0.00e+00 100
17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05 1.6e+04
1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0 0.00e+00
 0 0.00e+00 100

Notes about flop counters here,
* that MatMult flops are not logged as GPU flops but something is logged
nonetheless.
* The GPU flop rate is 5x the total flop rate  in KSPSolve :\
* I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
at < 1%.

Anway, not sure how to proceed but I thought I would share.
Maybe ask the Kokkos guys if the have looked at Crusher.

Mark
DM Object: box 64 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 64 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 64 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 64 MPI processes
type:

Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-21 Thread Scott Kruger
On 2022-01-20 21:40, Junchao Zhang did write:
> *  Email notification when one is mentioned or added as a reviewer

Like Barry, I get emails on these so I think your notification settings
are off.

> *  Color text in comment box
> *  Click a failed job, run the job with the *updated* branch

I doubt that they will ever allow this because it would get to
complicated, but there are improvements to workflow that could be made.

Ideal workflow:
 - Automatically detects that this is a resubmit, and runs the last
   failed job first; i.e., if linux-cuda-double fails, run that job
   first, and then rerun the rest of the pipeline if it passes (so that
   we get a clean pipeline for the MR).

Current workflow (from Satish) which works but is a pain:
 - Launch pipeline.  Stop it.  Find job on web page and start it
   manually.  If passes, hit run on pipeline.

Less-ideal-but-improved workflow:
Based on what I've seen the team do with the `pages:` job (which I
learned about this week), this might work?

Add something like this to `.test`:

  only:
  variables:
- $PETSC_RUN_JOB == $TEST_ARCH

So that could then launch a pipeline with:
PETSC_RUN_JOB = arch-ci-linux-cuda
except I'm pretty sure this won't work based on how those `$`'s are
interpreted.  Thoughts, Satish?

Other-Less-ideal-but=improved workflow:
I tried playing around with setting variables related to tags when you
launch a job; e.g., 
   PETSC_JOB_TAG = gpu:nvidia

where `gpu:nvidia` is a current tag that I also tried to label a job in
other ways, but I couldn't get it to work (documentation made me think
we could do this.  This was a couple of years ago though, and perhaps
they have something like this working.



> *  Allow one to reorder commits (e.g., the fix up commits generated from
> applying comments) and mark commits that should be fixed up
> *  Easily retarget a branch, e.g., from main to release (currently I have
> to checkout to local machine, do rebase, then push)

This is making a git gui in gitlab (GitKraken, gitk, lazygit, etc.)  
No disagreement, but the workflow issues should take much higher priority IMO.

Scott

 
> --Junchao Zhang
> 
> 
> On Thu, Jan 20, 2022 at 7:05 PM Barry Smith  wrote:
> 
> >
> >   I got asked to go over some of my Gitlab workflow uses next week with
> > some Gitlab developers; they do this to understand how Gitlab is used, how
> > it can be improved etc.
> >
> >   If anyone has ideas on topics I should hit, let me know. I will hit them
> > on the brokenness of appropriate code-owners not being automatically added
> > to reviewers. And support for people outside of the Petsc group to set more
> > things when they make MRs. And being to easily add non-PETSc folks as
> > reviewers.
> >
> >   Barry
> >
> >

-- 
Scott Kruger
Tech-X Corporation   kru...@txcorp.com
5621 Arapahoe Ave, Suite A   Phone: (720) 466-3196
Boulder, CO 80303Fax:   (303) 448-7756


Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-21 Thread jacob.fai
Almost forgot to mention, ask them if they can make popup-boxes (when e.g.
choosing a label, reviewers, etc) size dynamic or at the very least manually
resizable. The size is ok for smaller laptop screens, but only showing at
most 5 items in a 30+ item list on a larger screen with no way to make the
box bigger is a travesty.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)
Cell: (312) 694-3391

| -Original Message-
| From: jacob@gmail.com 
| Sent: Friday, January 21, 2022 07:34
| To: 'Lawrence Mitchell' ; 'Barry Smith' 
| Cc: 'petsc-dev' 
| Subject: RE: [petsc-dev] Gitlab workflow discussion with GitLab developers
| 
| 1. Allow searching pipeline jobs by name, type, or tag. I want to be able
to
| find all "linux-cuda-double-64idx" jobs that ran in the last 24 hours, or
all jobs
| that have the tag gpu:nvidia for example. Currently I must manually click
| through the pages and snoop on everyone's pipelines. It would be nice if
the
| https://gitlab.com/petsc/petsc/-/jobs page had the same "filter"
| search box that pipelines have.
| 2. The filter pipelines needs to be able to search for "awaiting
approval".
| I like to cancel all the zombie jobs that MRs create due to
auto-pipelines,
| which so far I have done by manually checking my MRs after pushes.
| 
| Best regards,
| 
| Jacob Faibussowitsch
| (Jacob Fai - booss - oh - vitch)
| Cell: (312) 694-3391
| 
| | -Original Message-
| | From: petsc-dev  On Behalf Of
| Lawrence
| | Mitchell
| | Sent: Friday, January 21, 2022 04:32
| | To: Barry Smith 
| | Cc: petsc-dev 
| | Subject: Re: [petsc-dev] Gitlab workflow discussion with GitLab
| | developers
| |
| |
| | > On 21 Jan 2022, at 01:05, Barry Smith  wrote:
| | >
| | > I got asked to go over some of my Gitlab workflow uses next week
| | > with
| | some Gitlab developers; they do this to understand how Gitlab is used,
| | how it can be improved etc.
| | >
| | >  If anyone has ideas on topics I should hit, let me know. I will hit
| them on
| | the brokenness of appropriate code-owners not being automatically
| | added to reviewers. And support for people outside of the Petsc group
| | to set
| more
| | things when they make MRs. And being to easily add non-PETSc folks as
| | reviewers.
| |
| | At least in my experience reviewing large (by diffstat measures) in
| | the browser (I use Safari) is nigh-on unusable since the web interface
| | grinds
| to a
| | halt, and has various other bugs.
| |
| | Of course one should aim for small code changes that are easy to
| | review,
| but
| | often a small code change will produce large changes in the test
| | output
| files
| | (which have to appear in the MR, or else one merges broken code).
| |
| | Some examples, if I go to one of Jacob's recent cleanup MRs:
| | https://gitlab.com/petsc/petsc/-/merge_requests/4700/diffs
| |
| | Dragging the scroll bar is very laggy (I guess there's some background
| thread
| | trying to load things from somewhere?).
| |
| | Semi-randomly, if I click to add a comment on a change, the page jumps
| back
| | to the start and I lose my place.
| |
| | This seems slightly better in Brave (Chromium).
| |
| | I don't know enough about how the web interface/database integration
| | works, but grinding to a halt on a 6000 line diff is unfortunate.
| |
| | Lawrence




Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-21 Thread Jed Brown
When applying suggestions, it should offer to "instant fixup" (apply it to some 
prior commit in this branch, but not in any other branches). That instant fixup 
should highlight commits that changed nearby lines.

When you make an inline comment and the author changes those lines of code, it 
now offers to show you "these lines", but in my experience, it's usually the 
wrong lines. I think because it's showing the same line numbers, not the same 
context (and something had changed the number of lines earlier).

BTW, I submitted a request for open source project status not long ago (maybe a 
week) with intent to unlock some project management features (like Epics).

I kinda wish merge requests could be displayed in Boards. Sometimes we use 
draft MRs to track actual bugs (rather than start by making an issue, then 
linking the issue) and it'd be nice to have a board with MRs that have set 
release as a milestone, with columns for their workflow stage.

Their license detector doesn't recognize BSD-2-Clause. It shows "Other" on the 
home page, even when I tried removing the appended disclaimer about 
--download-package.

https://gitlab.com/petsc/petsc/

Barry Smith  writes:

>   I got asked to go over some of my Gitlab workflow uses next week with some 
> Gitlab developers; they do this to understand how Gitlab is used, how it can 
> be improved etc. 
>
>   If anyone has ideas on topics I should hit, let me know. I will hit them on 
> the brokenness of appropriate code-owners not being automatically added to 
> reviewers. And support for people outside of the Petsc group to set more 
> things when they make MRs. And being to easily add non-PETSc folks as 
> reviewers.
>
>   Barry


Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-21 Thread jacob.fai
1. Allow searching pipeline jobs by name, type, or tag. I want to be able to
find all "linux-cuda-double-64idx" jobs that ran in the last 24 hours, or
all jobs that have the tag gpu:nvidia for example. Currently I must manually
click through the pages and snoop on everyone's pipelines. It would be nice
if the https://gitlab.com/petsc/petsc/-/jobs page had the same "filter"
search box that pipelines have. 
2. The filter pipelines needs to be able to search for "awaiting approval".
I like to cancel all the zombie jobs that MRs create due to auto-pipelines,
which so far I have done by manually checking my MRs after pushes.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)
Cell: (312) 694-3391

| -Original Message-
| From: petsc-dev  On Behalf Of Lawrence
| Mitchell
| Sent: Friday, January 21, 2022 04:32
| To: Barry Smith 
| Cc: petsc-dev 
| Subject: Re: [petsc-dev] Gitlab workflow discussion with GitLab developers
| 
| 
| > On 21 Jan 2022, at 01:05, Barry Smith  wrote:
| >
| > I got asked to go over some of my Gitlab workflow uses next week with
| some Gitlab developers; they do this to understand how Gitlab is used, how
| it can be improved etc.
| >
| >  If anyone has ideas on topics I should hit, let me know. I will hit
them on
| the brokenness of appropriate code-owners not being automatically added
| to reviewers. And support for people outside of the Petsc group to set
more
| things when they make MRs. And being to easily add non-PETSc folks as
| reviewers.
| 
| At least in my experience reviewing large (by diffstat measures) in the
| browser (I use Safari) is nigh-on unusable since the web interface grinds
to a
| halt, and has various other bugs.
| 
| Of course one should aim for small code changes that are easy to review,
but
| often a small code change will produce large changes in the test output
files
| (which have to appear in the MR, or else one merges broken code).
| 
| Some examples, if I go to one of Jacob's recent cleanup MRs:
| https://gitlab.com/petsc/petsc/-/merge_requests/4700/diffs
| 
| Dragging the scroll bar is very laggy (I guess there's some background
thread
| trying to load things from somewhere?).
| 
| Semi-randomly, if I click to add a comment on a change, the page jumps
back
| to the start and I lose my place.
| 
| This seems slightly better in Brave (Chromium).
| 
| I don't know enough about how the web interface/database integration
| works, but grinding to a halt on a 6000 line diff is unfortunate.
| 
| Lawrence



Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-21 Thread Lawrence Mitchell


> On 21 Jan 2022, at 01:05, Barry Smith  wrote:
> 
> I got asked to go over some of my Gitlab workflow uses next week with some 
> Gitlab developers; they do this to understand how Gitlab is used, how it can 
> be improved etc. 
> 
>  If anyone has ideas on topics I should hit, let me know. I will hit them on 
> the brokenness of appropriate code-owners not being automatically added to 
> reviewers. And support for people outside of the Petsc group to set more 
> things when they make MRs. And being to easily add non-PETSc folks as 
> reviewers.

At least in my experience reviewing large (by diffstat measures) in the browser 
(I use Safari) is nigh-on unusable since the web interface grinds to a halt, 
and has various other bugs.

Of course one should aim for small code changes that are easy to review, but 
often a small code change will produce large changes in the test output files 
(which have to appear in the MR, or else one merges broken code).

Some examples, if I go to one of Jacob's recent cleanup MRs: 
https://gitlab.com/petsc/petsc/-/merge_requests/4700/diffs

Dragging the scroll bar is very laggy (I guess there's some background thread 
trying to load things from somewhere?).

Semi-randomly, if I click to add a comment on a change, the page jumps back to 
the start and I lose my place.

This seems slightly better in Brave (Chromium).

I don't know enough about how the web interface/database integration works, but 
grinding to a halt on a 6000 line diff is unfortunate.

Lawrence

Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-21 Thread Patrick Sanan
Very much agreed that the biggest sort of friction is dealing with MRs from
forks. I suspect that the reason many of the things we want don't work is
because they would be too dangerous to allow a random, possibly malicious,
user to do. E.g. setting labels seems innocuous enough, but all kinds of
workflows, including automated ones, could be based on them. A more likely
problem in our case is that someone could open an MR with
"workflow::Ready-to-Merge" because they guess that it means that from their
perspective it's ready (when to us it means more than that). It would be
easy for that to get merged before being reviewed.

So in asking about all this, maybe we should make sure that we understand
the privilege levels GitLab offers, as maybe we can address the usual case
that the outside person making an MR is a researcher or engineer that one
of us knows (of) and so has some degree of trust in, so there would be no
huge risk in giving them the ability to change labels etc.

(And my pet peeve is that my "todo list" is still swamped by "X set you as
an approver for Y".)

Am Fr., 21. Jan. 2022 um 06:53 Uhr schrieb Barry Smith :

>
>
> On Jan 20, 2022, at 10:40 PM, Junchao Zhang 
> wrote:
>
> *  Email notification when one is mentioned or added as a reviewer
>
>
>Hmm, I get emails on these? I don't get email saying I am code owner
> for a MR
>
> *  Color text in comment box
> *  Click a failed job, run the job with the *updated* branch
> *  Allow one to reorder commits (e.g., the fix up commits generated from
> applying comments) and mark commits that should be fixed up
> *  Easily retarget a branch, e.g., from main to release (currently I have
> to checkout to local machine, do rebase, then push)
>
> --Junchao Zhang
>
>
> On Thu, Jan 20, 2022 at 7:05 PM Barry Smith  wrote:
>
>>
>>   I got asked to go over some of my Gitlab workflow uses next week with
>> some Gitlab developers; they do this to understand how Gitlab is used, how
>> it can be improved etc.
>>
>>   If anyone has ideas on topics I should hit, let me know. I will hit
>> them on the brokenness of appropriate code-owners not being automatically
>> added to reviewers. And support for people outside of the Petsc group to
>> set more things when they make MRs. And being to easily add non-PETSc folks
>> as reviewers.
>>
>>   Barry
>>
>>
>