Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-28 Thread Mark Adams
On Wed, Jan 26, 2022 at 2:51 PM Barry Smith  wrote:

>
>   I have added a mini-MR to print out the key so we can see if it is 0 or
> some crazy number. https://gitlab.com/petsc/petsc/-/merge_requests/4766
>

Well, after all of our MRs (Junchao's in particular) I am not seeing this
MPI error. So GPU aware MPI seems to be working.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
Valgrind was not useful. Just an MPI abort message with 128 process output.
Can we merge my MR and I can test your branch.

On Wed, Jan 26, 2022 at 2:51 PM Barry Smith  wrote:

>
>   I have added a mini-MR to print out the key so we can see if it is 0 or
> some crazy number. https://gitlab.com/petsc/petsc/-/merge_requests/4766
>
>   Note that the table data structure is not sent through MPI so if MPI is
> the culprit it is not just that MPI is putting incorrect (or no)
> information in the receive buffer; it is that MPI is seemingly messing up
> other data.
>
> On Jan 26, 2022, at 2:25 PM, Mark Adams  wrote:
>
> I have used valgrind here. I did not run it on this MPI error. I will.
>
> On Wed, Jan 26, 2022 at 10:56 AM Barry Smith  wrote:
>
>>
>>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
>> like a memory corruption issue and tracking down exactly when the
>> corruption begins is 3/4's of the way to finding the exact cause.
>>
>>   Are the crashes reproducible in the same place with identical runs?
>>
>>
>> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
>>
>> I think it is an MPI bug. It works with GPU aware MPI turned off.
>> I am sure Summit will be fine.
>> We have had users fix this error by switching thier MPI.
>>
>> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
>> wrote:
>>
>>> I don't know if this is due to bugs in petsc/kokkos backend.   See if
>>> you can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem
>>> on Summit with 8 nodes to see if it still fails. If yes, it is likely a bug
>>> of our own.
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>>>
 I am not able to reproduce this with a small problem. 2 nodes or less
 refinement works. This is from the 8 node test, the -dm_refine 5 version.
 I see that it comes from PtAP.
 This is on the fine grid. (I was thinking it could be on a reduced grid
 with idle processors, but no)

 [15]PETSC ERROR: Argument out of range
 [15]PETSC ERROR: Key <= 0
 [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
 shooting.
 [15]PETSC ERROR: Petsc Development GIT revision:
 v3.16.3-696-g46640c56cb  GIT Date: 2022-01-25 09:20:51 -0500
 [15]PETSC ERROR:
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
 arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
 [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
 --with-fc=ftn --with-fortran-bindings=0
 LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
 --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
 --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
 --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
 --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
 --download-p4est=1
 --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
 PETSC_ARCH=arch-olcf-crusher
 [15]PETSC ERROR: #1 PetscTableFind() at
 /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
 [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
 [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
 [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
 [15]PETSC ERROR: #5 MatAssemblyEnd() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
 [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices()
 at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
 [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
 [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
 [15]PETSC ERROR: #9 MatProductSymbolic() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
 [15]PETSC ERROR: #10 MatPtAP() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
 [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
 [15]PETSC ERROR: #12 PCSetUp_GAMG() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
 [15]PETSC ERROR: #13 PCSetUp() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
 [15]PETSC ERROR: #14 KSPSetUp() at
 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Barry Smith

  I have added a mini-MR to print out the key so we can see if it is 0 or some 
crazy number. https://gitlab.com/petsc/petsc/-/merge_requests/4766

  Note that the table data structure is not sent through MPI so if MPI is the 
culprit it is not just that MPI is putting incorrect (or no) information in the 
receive buffer; it is that MPI is seemingly messing up other data.

> On Jan 26, 2022, at 2:25 PM, Mark Adams  wrote:
> 
> I have used valgrind here. I did not run it on this MPI error. I will.
> 
> On Wed, Jan 26, 2022 at 10:56 AM Barry Smith  > wrote:
> 
>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks like 
> a memory corruption issue and tracking down exactly when the corruption 
> begins is 3/4's of the way to finding the exact cause.
> 
>   Are the crashes reproducible in the same place with identical runs?
> 
> 
>> On Jan 26, 2022, at 10:46 AM, Mark Adams > > wrote:
>> 
>> I think it is an MPI bug. It works with GPU aware MPI turned off. 
>> I am sure Summit will be fine.
>> We have had users fix this error by switching thier MPI.
>> 
>> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang > > wrote:
>> I don't know if this is due to bugs in petsc/kokkos backend.   See if you 
>> can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on 
>> Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of 
>> our own.
>> 
>> --Junchao Zhang
>> 
>> 
>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams > > wrote:
>> I am not able to reproduce this with a small problem. 2 nodes or less 
>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>> I see that it comes from PtAP.
>> This is on the fine grid. (I was thinking it could be on a reduced grid with 
>> idle processors, but no)
>> 
>> [15]PETSC ERROR: Argument out of range
>> [15]PETSC ERROR: Key <= 0
>> [15]PETSC ERROR: See https://petsc.org/release/faq/ 
>>  for trouble shooting.
>> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb  
>> GIT Date: 2022-01-25 09:20:51 -0500
>> [15]PETSC ERROR: 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC --with-fc=ftn 
>> --with-fortran-bindings=0 LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib 
>> -lmpi_gtl_hsa" --with-debugging=0 --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" 
>> --FOPTFLAGS=-g --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 
>> 00:10:00" --with-hip --with-hipc=hipcc --download-hypre 
>> --with-hip-arch=gfx90a --download-kokkos --download-kokkos-kernels 
>> --with-kokkos-kernels-tpl=0 --download-p4est=1 
>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>>  PETSC_ARCH=arch-olcf-crusher
>> [15]PETSC ERROR: #1 PetscTableFind() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>> [15]PETSC ERROR: #5 MatAssemblyEnd() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
>> [15]PETSC ERROR: #9 MatProductSymbolic() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
>> [15]PETSC ERROR: #10 MatPtAP() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
>> [15]PETSC ERROR: #13 PCSetUp() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
>> [15]PETSC ERROR: #14 KSPSetUp() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
>> [15]PETSC ERROR: #15 KSPSolve_Private() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
>> 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
On Wed, Jan 26, 2022 at 2:32 PM Justin Chang  wrote:

> rocgdb requires "-ggdb" in addition to "-g"
>

Ah, OK.


>
> What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was
> hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace"
> showing what the last successful HIP/HSA call was. I believe it should also
> show line numbers in the code.
>

I get a stack trace. The failure happens in our code. We can not find an
index that we received. The error message does not have the bad index. It
used to.
We have seen this before with buggy MPIs.


>
> On Wed, Jan 26, 2022 at 1:29 PM Mark Adams  wrote:
>
>>
>>
>> On Wed, Jan 26, 2022 at 1:54 PM Justin Chang  wrote:
>>
>>> Couple suggestions:
>>>
>>> 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will
>>> tell you everything that's happening at the HIP level (memcpy's, mallocs,
>>> kernel execution time, etc)
>>>
>>
>> Humm, My reproducer uses 2 nodes and 128 processes. Don't think I could
>> do much with this flood of data.
>>
>>
>>> 2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that
>>> we officially support.
>>>
>>
>> rocgdb just sat there reading symbols forever. I look at your doc.
>> Valgrind seem OK here.
>>
>>
>>> There are some tricks on running this together with mpi, to which you
>>> can just google "mpi with gdb". But you can see how rocgdb works here:
>>> https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf
>>>
>>>
>>> On Wed, Jan 26, 2022 at 9:56 AM Barry Smith  wrote:
>>>

   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
 like a memory corruption issue and tracking down exactly when the
 corruption begins is 3/4's of the way to finding the exact cause.

   Are the crashes reproducible in the same place with identical runs?


 On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:

 I think it is an MPI bug. It works with GPU aware MPI turned off.
 I am sure Summit will be fine.
 We have had users fix this error by switching thier MPI.

 On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
 wrote:

> I don't know if this is due to bugs in petsc/kokkos backend.   See if
> you can run 6 nodes (48 mpi ranks).  If it fails, then run the same 
> problem
> on Summit with 8 nodes to see if it still fails. If yes, it is likely a 
> bug
> of our own.
>
> --Junchao Zhang
>
>
> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>
>> I am not able to reproduce this with a small problem. 2 nodes or less
>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>> I see that it comes from PtAP.
>> This is on the fine grid. (I was thinking it could be on a reduced
>> grid with idle processors, but no)
>>
>> [15]PETSC ERROR: Argument out of range
>> [15]PETSC ERROR: Key <= 0
>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>> shooting.
>> [15]PETSC ERROR: Petsc Development GIT revision:
>> v3.16.3-696-g46640c56cb  GIT Date: 2022-01-25 09:20:51 -0500
>> [15]PETSC ERROR:
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
>> --with-fc=ftn --with-fortran-bindings=0
>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" 
>> --with-debugging=0
>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
>> --download-p4est=1
>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>> PETSC_ARCH=arch-olcf-crusher
>> [15]PETSC ERROR: #1 PetscTableFind() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>> [15]PETSC ERROR: #5 MatAssemblyEnd() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices()
>> at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
>> 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
>
>
>   Are the crashes reproducible in the same place with identical runs?
>
>
I have not seen my repoducer work and it is in MatAssemblyEnd with not
finding a table entry. I can't tell if it is the same error everytime.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Justin Chang
rocgdb requires "-ggdb" in addition to "-g"

What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was
hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace"
showing what the last successful HIP/HSA call was. I believe it should also
show line numbers in the code.

On Wed, Jan 26, 2022 at 1:29 PM Mark Adams  wrote:

>
>
> On Wed, Jan 26, 2022 at 1:54 PM Justin Chang  wrote:
>
>> Couple suggestions:
>>
>> 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will
>> tell you everything that's happening at the HIP level (memcpy's, mallocs,
>> kernel execution time, etc)
>>
>
> Humm, My reproducer uses 2 nodes and 128 processes. Don't think I could do
> much with this flood of data.
>
>
>> 2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that
>> we officially support.
>>
>
> rocgdb just sat there reading symbols forever. I look at your doc.
> Valgrind seem OK here.
>
>
>> There are some tricks on running this together with mpi, to which you can
>> just google "mpi with gdb". But you can see how rocgdb works here:
>> https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf
>>
>>
>> On Wed, Jan 26, 2022 at 9:56 AM Barry Smith  wrote:
>>
>>>
>>>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
>>> like a memory corruption issue and tracking down exactly when the
>>> corruption begins is 3/4's of the way to finding the exact cause.
>>>
>>>   Are the crashes reproducible in the same place with identical runs?
>>>
>>>
>>> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
>>>
>>> I think it is an MPI bug. It works with GPU aware MPI turned off.
>>> I am sure Summit will be fine.
>>> We have had users fix this error by switching thier MPI.
>>>
>>> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
>>> wrote:
>>>
 I don't know if this is due to bugs in petsc/kokkos backend.   See if
 you can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem
 on Summit with 8 nodes to see if it still fails. If yes, it is likely a bug
 of our own.

 --Junchao Zhang


 On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:

> I am not able to reproduce this with a small problem. 2 nodes or less
> refinement works. This is from the 8 node test, the -dm_refine 5 version.
> I see that it comes from PtAP.
> This is on the fine grid. (I was thinking it could be on a reduced
> grid with idle processors, but no)
>
> [15]PETSC ERROR: Argument out of range
> [15]PETSC ERROR: Key <= 0
> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
> shooting.
> [15]PETSC ERROR: Petsc Development GIT revision:
> v3.16.3-696-g46640c56cb  GIT Date: 2022-01-25 09:20:51 -0500
> [15]PETSC ERROR:
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
> --with-fc=ftn --with-fortran-bindings=0
> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" 
> --with-debugging=0
> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
> --download-p4est=1
> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
> PETSC_ARCH=arch-olcf-crusher
> [15]PETSC ERROR: #1 PetscTableFind() at
> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
> [15]PETSC ERROR: #5 MatAssemblyEnd() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices()
> at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
> [15]PETSC ERROR: #9 MatProductSymbolic() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
> [15]PETSC ERROR: #10 MatPtAP() 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
On Wed, Jan 26, 2022 at 1:54 PM Justin Chang  wrote:

> Couple suggestions:
>
> 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell
> you everything that's happening at the HIP level (memcpy's, mallocs, kernel
> execution time, etc)
>

Humm, My reproducer uses 2 nodes and 128 processes. Don't think I could do
much with this flood of data.


> 2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that we
> officially support.
>

rocgdb just sat there reading symbols forever. I look at your doc.
Valgrind seem OK here.


> There are some tricks on running this together with mpi, to which you can
> just google "mpi with gdb". But you can see how rocgdb works here:
> https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf
>
>
> On Wed, Jan 26, 2022 at 9:56 AM Barry Smith  wrote:
>
>>
>>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
>> like a memory corruption issue and tracking down exactly when the
>> corruption begins is 3/4's of the way to finding the exact cause.
>>
>>   Are the crashes reproducible in the same place with identical runs?
>>
>>
>> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
>>
>> I think it is an MPI bug. It works with GPU aware MPI turned off.
>> I am sure Summit will be fine.
>> We have had users fix this error by switching thier MPI.
>>
>> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
>> wrote:
>>
>>> I don't know if this is due to bugs in petsc/kokkos backend.   See if
>>> you can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem
>>> on Summit with 8 nodes to see if it still fails. If yes, it is likely a bug
>>> of our own.
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>>>
 I am not able to reproduce this with a small problem. 2 nodes or less
 refinement works. This is from the 8 node test, the -dm_refine 5 version.
 I see that it comes from PtAP.
 This is on the fine grid. (I was thinking it could be on a reduced grid
 with idle processors, but no)

 [15]PETSC ERROR: Argument out of range
 [15]PETSC ERROR: Key <= 0
 [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
 shooting.
 [15]PETSC ERROR: Petsc Development GIT revision:
 v3.16.3-696-g46640c56cb  GIT Date: 2022-01-25 09:20:51 -0500
 [15]PETSC ERROR:
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
 arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
 [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
 --with-fc=ftn --with-fortran-bindings=0
 LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
 --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
 --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
 --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
 --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
 --download-p4est=1
 --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
 PETSC_ARCH=arch-olcf-crusher
 [15]PETSC ERROR: #1 PetscTableFind() at
 /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
 [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
 [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
 [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
 [15]PETSC ERROR: #5 MatAssemblyEnd() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
 [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices()
 at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
 [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
 [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
 [15]PETSC ERROR: #9 MatProductSymbolic() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
 [15]PETSC ERROR: #10 MatPtAP() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
 [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
 [15]PETSC ERROR: #12 PCSetUp_GAMG() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
 [15]PETSC ERROR: #13 PCSetUp() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
 [15]PETSC 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
On Wed, Jan 26, 2022 at 2:25 PM Mark Adams  wrote:

> I have used valgrind here. I did not run it on this MPI error. I will.
>
> On Wed, Jan 26, 2022 at 10:56 AM Barry Smith  wrote:
>
>>
>>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
>> like a memory corruption issue and tracking down exactly when the
>> corruption begins is 3/4's of the way to finding the exact cause.
>>
>>   Are the crashes reproducible in the same place with identical runs?
>>
>>
>> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
>>
>> I think it is an MPI bug. It works with GPU aware MPI turned off.
>> I am sure Summit will be fine.
>> We have had users fix this error by switching thier MPI.
>>
>> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
>> wrote:
>>
>>> I don't know if this is due to bugs in petsc/kokkos backend.   See if
>>> you can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem
>>> on Summit with 8 nodes to see if it still fails. If yes, it is likely a bug
>>> of our own.
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>>>
 I am not able to reproduce this with a small problem. 2 nodes or less
 refinement works. This is from the 8 node test, the -dm_refine 5 version.
 I see that it comes from PtAP.
 This is on the fine grid. (I was thinking it could be on a reduced grid
 with idle processors, but no)

 [15]PETSC ERROR: Argument out of range
 [15]PETSC ERROR: Key <= 0
 [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
 shooting.
 [15]PETSC ERROR: Petsc Development GIT revision:
 v3.16.3-696-g46640c56cb  GIT Date: 2022-01-25 09:20:51 -0500
 [15]PETSC ERROR:
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
 arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
 [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
 --with-fc=ftn --with-fortran-bindings=0
 LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
 --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
 --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
 --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
 --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
 --download-p4est=1
 --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
 PETSC_ARCH=arch-olcf-crusher
 [15]PETSC ERROR: #1 PetscTableFind() at
 /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
 [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
 [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
 [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
 [15]PETSC ERROR: #5 MatAssemblyEnd() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
 [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices()
 at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
 [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
 [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
 [15]PETSC ERROR: #9 MatProductSymbolic() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
 [15]PETSC ERROR: #10 MatPtAP() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
 [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
 [15]PETSC ERROR: #12 PCSetUp_GAMG() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
 [15]PETSC ERROR: #13 PCSetUp() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
 [15]PETSC ERROR: #14 KSPSetUp() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
 [15]PETSC ERROR: #15 KSPSolve_Private() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
 [15]PETSC ERROR: #16 KSPSolve() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
 [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
 [15]PETSC ERROR: #18 SNESSolve() at
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
 [15]PETSC ERROR: #19 main() at ex13.c:169
 [15]PETSC 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
I have used valgrind here. I did not run it on this MPI error. I will.

On Wed, Jan 26, 2022 at 10:56 AM Barry Smith  wrote:

>
>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
> like a memory corruption issue and tracking down exactly when the
> corruption begins is 3/4's of the way to finding the exact cause.
>
>   Are the crashes reproducible in the same place with identical runs?
>
>
> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
>
> I think it is an MPI bug. It works with GPU aware MPI turned off.
> I am sure Summit will be fine.
> We have had users fix this error by switching thier MPI.
>
> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
> wrote:
>
>> I don't know if this is due to bugs in petsc/kokkos backend.   See if you
>> can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on
>> Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of
>> our own.
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>>
>>> I am not able to reproduce this with a small problem. 2 nodes or less
>>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>>> I see that it comes from PtAP.
>>> This is on the fine grid. (I was thinking it could be on a reduced grid
>>> with idle processors, but no)
>>>
>>> [15]PETSC ERROR: Argument out of range
>>> [15]PETSC ERROR: Key <= 0
>>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>> shooting.
>>> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb
>>>  GIT Date: 2022-01-25 09:20:51 -0500
>>> [15]PETSC ERROR:
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
>>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
>>> --with-fc=ftn --with-fortran-bindings=0
>>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
>>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
>>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
>>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
>>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
>>> --download-p4est=1
>>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>>> PETSC_ARCH=arch-olcf-crusher
>>> [15]PETSC ERROR: #1 PetscTableFind() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>>> [15]PETSC ERROR: #5 MatAssemblyEnd() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
>>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
>>> [15]PETSC ERROR: #9 MatProductSymbolic() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
>>> [15]PETSC ERROR: #10 MatPtAP() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
>>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
>>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
>>> [15]PETSC ERROR: #13 PCSetUp() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
>>> [15]PETSC ERROR: #14 KSPSetUp() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
>>> [15]PETSC ERROR: #15 KSPSolve_Private() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
>>> [15]PETSC ERROR: #16 KSPSolve() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
>>> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
>>> [15]PETSC ERROR: #18 SNESSolve() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
>>> [15]PETSC ERROR: #19 main() at ex13.c:169
>>> [15]PETSC ERROR: PETSc Option Table entries:
>>> [15]PETSC ERROR: -benchmark_it 10
>>>
>>> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams  wrote:
>>>
 The GPU aware MPI 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Justin Chang
Couple suggestions:

1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell
you everything that's happening at the HIP level (memcpy's, mallocs, kernel
execution time, etc)
2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that we
officially support. There are some tricks on running this together with
mpi, to which you can just google "mpi with gdb". But you can see how
rocgdb works here:
https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf


On Wed, Jan 26, 2022 at 9:56 AM Barry Smith  wrote:

>
>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
> like a memory corruption issue and tracking down exactly when the
> corruption begins is 3/4's of the way to finding the exact cause.
>
>   Are the crashes reproducible in the same place with identical runs?
>
>
> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
>
> I think it is an MPI bug. It works with GPU aware MPI turned off.
> I am sure Summit will be fine.
> We have had users fix this error by switching thier MPI.
>
> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
> wrote:
>
>> I don't know if this is due to bugs in petsc/kokkos backend.   See if you
>> can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on
>> Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of
>> our own.
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>>
>>> I am not able to reproduce this with a small problem. 2 nodes or less
>>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>>> I see that it comes from PtAP.
>>> This is on the fine grid. (I was thinking it could be on a reduced grid
>>> with idle processors, but no)
>>>
>>> [15]PETSC ERROR: Argument out of range
>>> [15]PETSC ERROR: Key <= 0
>>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>> shooting.
>>> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb
>>>  GIT Date: 2022-01-25 09:20:51 -0500
>>> [15]PETSC ERROR:
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
>>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
>>> --with-fc=ftn --with-fortran-bindings=0
>>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
>>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
>>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
>>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
>>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
>>> --download-p4est=1
>>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>>> PETSC_ARCH=arch-olcf-crusher
>>> [15]PETSC ERROR: #1 PetscTableFind() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>>> [15]PETSC ERROR: #5 MatAssemblyEnd() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
>>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
>>> [15]PETSC ERROR: #9 MatProductSymbolic() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
>>> [15]PETSC ERROR: #10 MatPtAP() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
>>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
>>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
>>> [15]PETSC ERROR: #13 PCSetUp() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
>>> [15]PETSC ERROR: #14 KSPSetUp() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
>>> [15]PETSC ERROR: #15 KSPSolve_Private() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
>>> [15]PETSC ERROR: #16 KSPSolve() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
>>> [15]PETSC 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Barry Smith

  Any way to run with valgrind (or a HIP variant of valgrind)? It looks like a 
memory corruption issue and tracking down exactly when the corruption begins is 
3/4's of the way to finding the exact cause.

  Are the crashes reproducible in the same place with identical runs?


> On Jan 26, 2022, at 10:46 AM, Mark Adams  wrote:
> 
> I think it is an MPI bug. It works with GPU aware MPI turned off. 
> I am sure Summit will be fine.
> We have had users fix this error by switching thier MPI.
> 
> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang  > wrote:
> I don't know if this is due to bugs in petsc/kokkos backend.   See if you can 
> run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on Summit 
> with 8 nodes to see if it still fails. If yes, it is likely a bug of our own.
> 
> --Junchao Zhang
> 
> 
> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  > wrote:
> I am not able to reproduce this with a small problem. 2 nodes or less 
> refinement works. This is from the 8 node test, the -dm_refine 5 version.
> I see that it comes from PtAP.
> This is on the fine grid. (I was thinking it could be on a reduced grid with 
> idle processors, but no)
> 
> [15]PETSC ERROR: Argument out of range
> [15]PETSC ERROR: Key <= 0
> [15]PETSC ERROR: See https://petsc.org/release/faq/ 
>  for trouble shooting.
> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb  GIT 
> Date: 2022-01-25 09:20:51 -0500
> [15]PETSC ERROR: 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC --with-fc=ftn 
> --with-fortran-bindings=0 LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib 
> -lmpi_gtl_hsa" --with-debugging=0 --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" 
> --FOPTFLAGS=-g --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 
> 00:10:00" --with-hip --with-hipc=hipcc --download-hypre 
> --with-hip-arch=gfx90a --download-kokkos --download-kokkos-kernels 
> --with-kokkos-kernels-tpl=0 --download-p4est=1 
> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>  PETSC_ARCH=arch-olcf-crusher
> [15]PETSC ERROR: #1 PetscTableFind() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
> [15]PETSC ERROR: #5 MatAssemblyEnd() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
> [15]PETSC ERROR: #9 MatProductSymbolic() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
> [15]PETSC ERROR: #10 MatPtAP() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
> [15]PETSC ERROR: #12 PCSetUp_GAMG() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
> [15]PETSC ERROR: #13 PCSetUp() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
> [15]PETSC ERROR: #14 KSPSetUp() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
> [15]PETSC ERROR: #15 KSPSolve_Private() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
> [15]PETSC ERROR: #16 KSPSolve() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
> [15]PETSC ERROR: #18 SNESSolve() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
> [15]PETSC ERROR: #19 main() at ex13.c:169
> [15]PETSC ERROR: PETSc Option Table entries:
> [15]PETSC ERROR: -benchmark_it 10
> 
> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams  > wrote:
> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
> I will make a minimum reproducer. start with 2 nodes, one process on each 
> 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
I think it is an MPI bug. It works with GPU aware MPI turned off.
I am sure Summit will be fine.
We have had users fix this error by switching thier MPI.

On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang 
wrote:

> I don't know if this is due to bugs in petsc/kokkos backend.   See if you
> can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on
> Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of
> our own.
>
> --Junchao Zhang
>
>
> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:
>
>> I am not able to reproduce this with a small problem. 2 nodes or less
>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>> I see that it comes from PtAP.
>> This is on the fine grid. (I was thinking it could be on a reduced grid
>> with idle processors, but no)
>>
>> [15]PETSC ERROR: Argument out of range
>> [15]PETSC ERROR: Key <= 0
>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
>> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb
>>  GIT Date: 2022-01-25 09:20:51 -0500
>> [15]PETSC ERROR:
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
>> --with-fc=ftn --with-fortran-bindings=0
>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
>> --download-p4est=1
>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>> PETSC_ARCH=arch-olcf-crusher
>> [15]PETSC ERROR: #1 PetscTableFind() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>> [15]PETSC ERROR: #5 MatAssemblyEnd() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
>> [15]PETSC ERROR: #9 MatProductSymbolic() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
>> [15]PETSC ERROR: #10 MatPtAP() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
>> [15]PETSC ERROR: #13 PCSetUp() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
>> [15]PETSC ERROR: #14 KSPSetUp() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
>> [15]PETSC ERROR: #15 KSPSolve_Private() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
>> [15]PETSC ERROR: #16 KSPSolve() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
>> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
>> [15]PETSC ERROR: #18 SNESSolve() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
>> [15]PETSC ERROR: #19 main() at ex13.c:169
>> [15]PETSC ERROR: PETSc Option Table entries:
>> [15]PETSC ERROR: -benchmark_it 10
>>
>> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams  wrote:
>>
>>> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
>>> I will make a minimum reproducer. start with 2 nodes, one process on
>>> each node.
>>>
>>>
>>> On Tue, Jan 25, 2022 at 10:19 PM Barry Smith  wrote:
>>>

   So the MPI is killing you in going from 8 to 64. (The GPU flop rate
 scales almost perfectly, but the overall flop rate is only half of what it
 should be at 64).

 On Jan 25, 2022, at 9:24 PM, Mark Adams  wrote:

 It looks like we have our instrumentation and job configuration in
 decent shape so on to scaling 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Junchao Zhang
I don't know if this is due to bugs in petsc/kokkos backend.   See if you
can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on
Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of
our own.

--Junchao Zhang


On Wed, Jan 26, 2022 at 8:44 AM Mark Adams  wrote:

> I am not able to reproduce this with a small problem. 2 nodes or less
> refinement works. This is from the 8 node test, the -dm_refine 5 version.
> I see that it comes from PtAP.
> This is on the fine grid. (I was thinking it could be on a reduced grid
> with idle processors, but no)
>
> [15]PETSC ERROR: Argument out of range
> [15]PETSC ERROR: Key <= 0
> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb
>  GIT Date: 2022-01-25 09:20:51 -0500
> [15]PETSC ERROR:
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
> --with-fc=ftn --with-fortran-bindings=0
> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
> --download-p4est=1
> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
> PETSC_ARCH=arch-olcf-crusher
> [15]PETSC ERROR: #1 PetscTableFind() at
> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
> [15]PETSC ERROR: #5 MatAssemblyEnd() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
> [15]PETSC ERROR: #9 MatProductSymbolic() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
> [15]PETSC ERROR: #10 MatPtAP() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
> [15]PETSC ERROR: #12 PCSetUp_GAMG() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
> [15]PETSC ERROR: #13 PCSetUp() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
> [15]PETSC ERROR: #14 KSPSetUp() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
> [15]PETSC ERROR: #15 KSPSolve_Private() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
> [15]PETSC ERROR: #16 KSPSolve() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
> [15]PETSC ERROR: #18 SNESSolve() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
> [15]PETSC ERROR: #19 main() at ex13.c:169
> [15]PETSC ERROR: PETSc Option Table entries:
> [15]PETSC ERROR: -benchmark_it 10
>
> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams  wrote:
>
>> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
>> I will make a minimum reproducer. start with 2 nodes, one process on each
>> node.
>>
>>
>> On Tue, Jan 25, 2022 at 10:19 PM Barry Smith  wrote:
>>
>>>
>>>   So the MPI is killing you in going from 8 to 64. (The GPU flop rate
>>> scales almost perfectly, but the overall flop rate is only half of what it
>>> should be at 64).
>>>
>>> On Jan 25, 2022, at 9:24 PM, Mark Adams  wrote:
>>>
>>> It looks like we have our instrumentation and job configuration in
>>> decent shape so on to scaling with AMG.
>>> In using multiple nodes I got errors with table entries not found, which
>>> can be caused by a buggy MPI, and the problem does go away when I turn GPU
>>> aware MPI off.
>>> Jed's analysis, if I have this right, is that at *0.7T* flops we are at
>>> about 35% of theoretical peal wrt memory 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
I am not able to reproduce this with a small problem. 2 nodes or less
refinement works. This is from the 8 node test, the -dm_refine 5 version.
I see that it comes from PtAP.
This is on the fine grid. (I was thinking it could be on a reduced grid
with idle processors, but no)

[15]PETSC ERROR: Argument out of range
[15]PETSC ERROR: Key <= 0
[15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb
 GIT Date: 2022-01-25 09:20:51 -0500
[15]PETSC ERROR:
/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
[15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC --with-fc=ftn
--with-fortran-bindings=0 LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib
-lmpi_gtl_hsa" --with-debugging=0 --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O"
--FOPTFLAGS=-g --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t
00:10:00" --with-hip --with-hipc=hipcc --download-hypre
--with-hip-arch=gfx90a --download-kokkos --download-kokkos-kernels
--with-kokkos-kernels-tpl=0 --download-p4est=1
--with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
PETSC_ARCH=arch-olcf-crusher
[15]PETSC ERROR: #1 PetscTableFind() at
/gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
[15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
[15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
[15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
[15]PETSC ERROR: #5 MatAssemblyEnd() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
[15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
[15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
[15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
[15]PETSC ERROR: #9 MatProductSymbolic() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
[15]PETSC ERROR: #10 MatPtAP() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
[15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
[15]PETSC ERROR: #12 PCSetUp_GAMG() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
[15]PETSC ERROR: #13 PCSetUp() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
[15]PETSC ERROR: #14 KSPSetUp() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
[15]PETSC ERROR: #15 KSPSolve_Private() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
[15]PETSC ERROR: #16 KSPSolve() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
[15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
[15]PETSC ERROR: #18 SNESSolve() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
[15]PETSC ERROR: #19 main() at ex13.c:169
[15]PETSC ERROR: PETSc Option Table entries:
[15]PETSC ERROR: -benchmark_it 10

On Wed, Jan 26, 2022 at 7:26 AM Mark Adams  wrote:

> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
> I will make a minimum reproducer. start with 2 nodes, one process on each
> node.
>
>
> On Tue, Jan 25, 2022 at 10:19 PM Barry Smith  wrote:
>
>>
>>   So the MPI is killing you in going from 8 to 64. (The GPU flop rate
>> scales almost perfectly, but the overall flop rate is only half of what it
>> should be at 64).
>>
>> On Jan 25, 2022, at 9:24 PM, Mark Adams  wrote:
>>
>> It looks like we have our instrumentation and job configuration in decent
>> shape so on to scaling with AMG.
>> In using multiple nodes I got errors with table entries not found, which
>> can be caused by a buggy MPI, and the problem does go away when I turn GPU
>> aware MPI off.
>> Jed's analysis, if I have this right, is that at *0.7T* flops we are at
>> about 35% of theoretical peal wrt memory bandwidth.
>> I run out of memory with the next step in this study (7 levels of
>> refinement), with 2M equations per GPU. This seems low to me and we will
>> see if we can fix this.
>> So this 0.7Tflops is with only 1/4 M equations so 35% is not terrible.
>> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6 levels
>> of refinement.
>>
>> out_001_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.2933e+00
>> 1.0 4.13e+10 1.1 1.8e+05 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
I will make a minimum reproducer. start with 2 nodes, one process on each
node.


On Tue, Jan 25, 2022 at 10:19 PM Barry Smith  wrote:

>
>   So the MPI is killing you in going from 8 to 64. (The GPU flop rate
> scales almost perfectly, but the overall flop rate is only half of what it
> should be at 64).
>
> On Jan 25, 2022, at 9:24 PM, Mark Adams  wrote:
>
> It looks like we have our instrumentation and job configuration in decent
> shape so on to scaling with AMG.
> In using multiple nodes I got errors with table entries not found, which
> can be caused by a buggy MPI, and the problem does go away when I turn GPU
> aware MPI off.
> Jed's analysis, if I have this right, is that at *0.7T* flops we are at
> about 35% of theoretical peal wrt memory bandwidth.
> I run out of memory with the next step in this study (7 levels of
> refinement), with 2M equations per GPU. This seems low to me and we will
> see if we can fix this.
> So this 0.7Tflops is with only 1/4 M equations so 35% is not terrible.
> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6 levels of
> refinement.
>
> out_001_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.2933e+00 1.0
> 4.13e+10 1.1 1.8e+05 8.4e+03 5.8e+02  3 87 86 78 48 100100100100100 248792
>   423857   6840 3.85e+02 6792 3.85e+02 100
> out_001_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 5.3667e+00 1.0
> 3.89e+11 1.0 2.1e+05 3.3e+04 6.7e+02  2 87 86 79 48 100100100100100 571572
>   *72*   7920 1.74e+03 7920 1.74e+03 100
> out_008_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.9407e+00 1.0
> 4.94e+10 1.1 3.5e+06 6.2e+03 6.7e+02  5 87 86 79 47 100100100100100 1581096
>   3034723   7920 6.88e+02 7920 6.88e+02 100
> out_008_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 7.4478e+00 1.0
> 4.49e+11 1.0 4.1e+06 2.3e+04 7.6e+02  2 88 87 80 49 100100100100100 3798162
>   5557106   9367 3.02e+03 9359 3.02e+03 100
> out_064_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 2.4551e+00 1.0
> 5.40e+10 1.1 4.2e+07 5.4e+03 7.3e+02  5 88 87 80 47 100100100100100
> 11065887   23792978   8684 8.90e+02 8683 8.90e+02 100
> out_064_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 1.1335e+01 1.0
> 5.38e+11 1.0 5.4e+07 2.0e+04 9.1e+02  4 88 88 82 49 100100100100100
> 24130606   43326249   11249 4.26e+03 11249 4.26e+03 100
>
> On Tue, Jan 25, 2022 at 1:49 PM Mark Adams  wrote:
>
>>
>>> Note that Mark's logs have been switching back and forth between
>>> -use_gpu_aware_mpi and changing number of ranks -- we won't have that
>>> information if we do manual timing hacks. This is going to be a routine
>>> thing we'll need on the mailing list and we need the provenance to go with
>>> it.
>>>
>>
>> GPU aware MPI crashes sometimes so to be safe, while debugging, I had it
>> off. It works fine here so it has been on in the last tests.
>> Here is a comparison.
>>
>>
> 
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith

  So the MPI is killing you in going from 8 to 64. (The GPU flop rate scales 
almost perfectly, but the overall flop rate is only half of what it should be 
at 64).

> On Jan 25, 2022, at 9:24 PM, Mark Adams  wrote:
> 
> It looks like we have our instrumentation and job configuration in decent 
> shape so on to scaling with AMG.
> In using multiple nodes I got errors with table entries not found, which can 
> be caused by a buggy MPI, and the problem does go away when I turn GPU aware 
> MPI off.
> Jed's analysis, if I have this right, is that at 0.7T flops we are at about 
> 35% of theoretical peal wrt memory bandwidth.
> I run out of memory with the next step in this study (7 levels of 
> refinement), with 2M equations per GPU. This seems low to me and we will see 
> if we can fix this.
> So this 0.7Tflops is with only 1/4 M equations so 35% is not terrible.
> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6 levels of 
> refinement.
> 
> out_001_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.2933e+00 1.0 
> 4.13e+10 1.1 1.8e+05 8.4e+03 5.8e+02  3 87 86 78 48 100100100100100 248792   
> 423857   6840 3.85e+02 6792 3.85e+02 100
> out_001_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 5.3667e+00 1.0 
> 3.89e+11 1.0 2.1e+05 3.3e+04 6.7e+02  2 87 86 79 48 100100100100100 571572   
> 72   7920 1.74e+03 7920 1.74e+03 100
> out_008_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 1.9407e+00 1.0 
> 4.94e+10 1.1 3.5e+06 6.2e+03 6.7e+02  5 87 86 79 47 100100100100100 1581096   
> 3034723   7920 6.88e+02 7920 6.88e+02 100
> out_008_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 7.4478e+00 1.0 
> 4.49e+11 1.0 4.1e+06 2.3e+04 7.6e+02  2 88 87 80 49 100100100100100 3798162   
> 5557106   9367 3.02e+03 9359 3.02e+03 100
> out_064_kokkos_Crusher_5_1.txt:KSPSolve  10 1.0 2.4551e+00 1.0 
> 5.40e+10 1.1 4.2e+07 5.4e+03 7.3e+02  5 88 87 80 47 100100100100100 11065887  
>  23792978   8684 8.90e+02 8683 8.90e+02 100
> out_064_kokkos_Crusher_6_1.txt:KSPSolve  10 1.0 1.1335e+01 1.0 
> 5.38e+11 1.0 5.4e+07 2.0e+04 9.1e+02  4 88 88 82 49 100100100100100 24130606  
>  43326249   11249 4.26e+03 11249 4.26e+03 100
> 
> On Tue, Jan 25, 2022 at 1:49 PM Mark Adams  > wrote:
> 
> Note that Mark's logs have been switching back and forth between 
> -use_gpu_aware_mpi and changing number of ranks -- we won't have that 
> information if we do manual timing hacks. This is going to be a routine thing 
> we'll need on the mailing list and we need the provenance to go with it.
> 
> GPU aware MPI crashes sometimes so to be safe, while debugging, I had it off. 
> It works fine here so it has been on in the last tests.
> Here is a comparison.
>  
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
>
>
> Note that Mark's logs have been switching back and forth between
> -use_gpu_aware_mpi and changing number of ranks -- we won't have that
> information if we do manual timing hacks. This is going to be a routine
> thing we'll need on the mailing list and we need the provenance to go with
> it.
>

GPU aware MPI crashes sometimes so to be safe, while debugging, I had it
off. It works fine here so it has been on in the last tests.
Here is a comparison.
Script started on 2022-01-25 13:44:31-05:00 [TERM="xterm-256color" 
TTY="/dev/pts/0" COLUMNS="296" LINES="100"]
13:44 adams/aijkokkos-gpu-logging *= 
crusher:/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
 exitbash -x 
run_crusher_jac.sbatchexitbash -x 
run_crusher_jac.sbatch
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output 
(/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ NG=8
+ NC=1
+ date
Tue 25 Jan 2022 01:44:38 PM EST
+ EXTRA='-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi 
-pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_coarsen_type PMIS 
-pc_hypre_boomeramg_no_CF'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_no_CF true 
-pc_hypre_boomeramg_strong_threshold 0.75 -pc_hypre_boomeramg_agg_nl 1 
-pc_hypre_boomeramg_coarsen_type HMIS -pc_hypre_boomeramg_interp_type ext+i '
+ for REFINE in 5
+ for NPIDX in 1
+ let 'N1 = 1 * 1'
++ bc -l
+ PG=2.
++ printf %.0f 2.
+ PG=2
+ let 'NCC = 8 / 1'
+ let 'N4 = 2 * 1'
+ let 'NODES = 1 * 1 * 1'
+ let 'N = 1 * 1 * 8'
+ echo n= 8 ' NODES=' 1 ' NC=' 1 ' PG=' 2
n= 8  NODES= 1  NC= 1  PG= 2
++ printf %03d 1
+ foo=001
+ srun -n8 -N1 --ntasks-per-gpu=1 --gpu-bind=closest -c 8 ../ex13 
-dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2 
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 5 
-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true -dm_mat_type aijkokkos 
-dm_vec_type kokkos -pc_type jacobi
+ tee jac_out_001_kokkos_Crusher_5_1_noview.txt
DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
Solve time: 0.34211
#PETSc Option Table entries:
-benchmark_it 2
-dm_distribute
-dm_mat_type aijkokkos
-dm_plex_box_faces 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
Here are two runs, without and with -log_view, respectively.
My new timer is "Solve time = "
About 10% difference

On Tue, Jan 25, 2022 at 12:53 PM Mark Adams  wrote:

> BTW, a -device_view would be great.
>
> On Tue, Jan 25, 2022 at 12:30 PM Mark Adams  wrote:
>
>>
>>
>> On Tue, Jan 25, 2022 at 11:56 AM Jed Brown  wrote:
>>
>>> Barry Smith  writes:
>>>
>>> >   Thanks Mark, far more interesting. I've improved the formatting to
>>> make it easier to read (and fixed width font for email reading)
>>> >
>>> >   * Can you do same run with say 10 iterations of Jacobi PC?
>>> >
>>> >   * PCApply performance (looks like GAMG) is terrible! Problems too
>>> small?
>>>
>>> This is -pc_type jacobi.
>>>
>>> >   * VecScatter time is completely dominated by SFPack! Junchao what's
>>> up with that? Lots of little kernels in the PCApply? PCJACOBI run will help
>>> clarify where that is coming from.
>>>
>>> It's all in MatMult.
>>>
>>> I'd like to see a run that doesn't wait for the GPU.
>>>
>>>
>> Not sure what you mean. Can I do that?
>>
>>
>
Script started on 2022-01-25 13:33:45-05:00 [TERM="xterm-256color" 
TTY="/dev/pts/0" COLUMNS="296" LINES="100"]
13:33 adams/aijkokkos-gpu-logging *= 
crusher:/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
 bash -x run_crusher_jac.sbatch
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output 
(/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ NG=8
+ NC=1
+ date
Tue 25 Jan 2022 01:33:53 PM EST
+ EXTRA='-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi 
-pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_coarsen_type PMIS 
-pc_hypre_boomeramg_no_CF'
+ HYPRE_EXTRA='-pc_hypre_boomeramg_no_CF true 
-pc_hypre_boomeramg_strong_threshold 0.75 -pc_hypre_boomeramg_agg_nl 1 
-pc_hypre_boomeramg_coarsen_type HMIS -pc_hypre_boomeramg_interp_type ext+i '
+ for REFINE in 5
+ for NPIDX in 1
+ let 'N1 = 1 * 1'
++ bc -l
+ PG=2.
++ printf %.0f 2.
+ PG=2
+ let 'NCC = 8 / 1'
+ let 'N4 = 2 * 1'
+ let 'NODES = 1 * 1 * 1'
+ let 'N = 1 * 1 * 8'
+ echo n= 8 ' NODES=' 1 ' NC=' 1 ' PG=' 2
n= 8  NODES= 1  NC= 1  PG= 2
++ printf %03d 1
+ foo=001
+ srun -n8 -N1 --ntasks-per-gpu=1 --gpu-bind=closest -c 8 ../ex13 
-dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2 
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 5 
-dm_view -log_viewx -ksp_view -use_gpu_aware_mpi true -dm_mat_type aijkokkos 
-dm_vec_type kokkos -pc_type jacobi
+ tee jac_out_001_kokkos_Crusher_5_1_noview.txt
DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=2048383, cols=2048383
total: nonzeros=127263527, allocated nonzeros=127263527
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith  writes:

>> What is the command line option to turn 
>> PetscLogGpuTimeBegin/PetscLogGpuTimeEnd into a no-op even when -log_view is 
>> on? I know it'll mess up attribution, but it'll still tell us how long the 
>> solve took.
>
>   We don't have an API for this yet. It is slightly tricky because turning it 
> off will also break the regular -log_view for some stuff like VecAXPY(); 
> anything that doesn't have a needed synchronization with the CPU.) 

Of course it will misattribute time, but the high-level (like KSPSolve) is 
still useful. We need an option for this so we can still have -log_view output.

Note that Mark's logs have been switching back and forth between 
-use_gpu_aware_mpi and changing number of ranks -- we won't have that 
information if we do manual timing hacks. This is going to be a routine thing 
we'll need on the mailing list and we need the provenance to go with it.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
BTW, a -device_view would be great.

On Tue, Jan 25, 2022 at 12:30 PM Mark Adams  wrote:

>
>
> On Tue, Jan 25, 2022 at 11:56 AM Jed Brown  wrote:
>
>> Barry Smith  writes:
>>
>> >   Thanks Mark, far more interesting. I've improved the formatting to
>> make it easier to read (and fixed width font for email reading)
>> >
>> >   * Can you do same run with say 10 iterations of Jacobi PC?
>> >
>> >   * PCApply performance (looks like GAMG) is terrible! Problems too
>> small?
>>
>> This is -pc_type jacobi.
>>
>> >   * VecScatter time is completely dominated by SFPack! Junchao what's
>> up with that? Lots of little kernels in the PCApply? PCJACOBI run will help
>> clarify where that is coming from.
>>
>> It's all in MatMult.
>>
>> I'd like to see a run that doesn't wait for the GPU.
>>
>>
> Not sure what you mean. Can I do that?
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith



> On Jan 25, 2022, at 12:25 PM, Jed Brown  wrote:
> 
> Barry Smith  writes:
> 
>>> On Jan 25, 2022, at 11:55 AM, Jed Brown  wrote:
>>> 
>>> Barry Smith  writes:
>>> 
 Thanks Mark, far more interesting. I've improved the formatting to make it 
 easier to read (and fixed width font for email reading)
 
 * Can you do same run with say 10 iterations of Jacobi PC?
 
 * PCApply performance (looks like GAMG) is terrible! Problems too small?
>>> 
>>> This is -pc_type jacobi.
>> 
>>  Dang, how come it doesn't warn about all the gamg arguments passed to the 
>> program? I saw them and jump to the wrong conclusion.
> 
> We don't have -options_left by default. Mark has a big .petscrc or 
> PETSC_OPTIONS.
> 
>>  How come PCApply is so low while Pointwise mult (which should be all of 
>> PCApply) is high?
> 
> I also think that's weird.
> 
>>> 
 * VecScatter time is completely dominated by SFPack! Junchao what's up 
 with that? Lots of little kernels in the PCApply? PCJACOBI run will help 
 clarify where that is coming from.
>>> 
>>> It's all in MatMult.
>>> 
>>> I'd like to see a run that doesn't wait for the GPU.
>> 
>>  Indeed
> 
> What is the command line option to turn 
> PetscLogGpuTimeBegin/PetscLogGpuTimeEnd into a no-op even when -log_view is 
> on? I know it'll mess up attribution, but it'll still tell us how long the 
> solve took.

  We don't have an API for this yet. It is slightly tricky because turning it 
off will also break the regular -log_view for some stuff like VecAXPY(); 
anything that doesn't have a needed synchronization with the CPU.) 

  Because of this I think Mark should just put a PetscTime() around KSPSolve 
run without -log_view and we can compare that number to the one from -log_view 
to see how much the synchronousness of PetscLogGPUTime is causing. Ad hoc yes, 
but a quick easy way to get the information.

> 
> Also, can we make WaitForKokkos a no-op? I don't think it's necessary for 
> correctness (docs indicate kokkos::fence synchronizes).



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
On Tue, Jan 25, 2022 at 11:56 AM Jed Brown  wrote:

> Barry Smith  writes:
>
> >   Thanks Mark, far more interesting. I've improved the formatting to
> make it easier to read (and fixed width font for email reading)
> >
> >   * Can you do same run with say 10 iterations of Jacobi PC?
> >
> >   * PCApply performance (looks like GAMG) is terrible! Problems too
> small?
>
> This is -pc_type jacobi.
>
> >   * VecScatter time is completely dominated by SFPack! Junchao what's up
> with that? Lots of little kernels in the PCApply? PCJACOBI run will help
> clarify where that is coming from.
>
> It's all in MatMult.
>
> I'd like to see a run that doesn't wait for the GPU.
>
>
Not sure what you mean. Can I do that?


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith  writes:

>> On Jan 25, 2022, at 11:55 AM, Jed Brown  wrote:
>> 
>> Barry Smith  writes:
>> 
>>>  Thanks Mark, far more interesting. I've improved the formatting to make it 
>>> easier to read (and fixed width font for email reading)
>>> 
>>>  * Can you do same run with say 10 iterations of Jacobi PC?
>>> 
>>>  * PCApply performance (looks like GAMG) is terrible! Problems too small?
>> 
>> This is -pc_type jacobi.
>
>   Dang, how come it doesn't warn about all the gamg arguments passed to the 
> program? I saw them and jump to the wrong conclusion.

We don't have -options_left by default. Mark has a big .petscrc or 
PETSC_OPTIONS.

>   How come PCApply is so low while Pointwise mult (which should be all of 
> PCApply) is high?

I also think that's weird.

>> 
>>>  * VecScatter time is completely dominated by SFPack! Junchao what's up 
>>> with that? Lots of little kernels in the PCApply? PCJACOBI run will help 
>>> clarify where that is coming from.
>> 
>> It's all in MatMult.
>> 
>> I'd like to see a run that doesn't wait for the GPU.
>
>   Indeed

What is the command line option to turn PetscLogGpuTimeBegin/PetscLogGpuTimeEnd 
into a no-op even when -log_view is on? I know it'll mess up attribution, but 
it'll still tell us how long the solve took.

Also, can we make WaitForKokkos a no-op? I don't think it's necessary for 
correctness (docs indicate kokkos::fence synchronizes).


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith



> On Jan 25, 2022, at 11:55 AM, Jed Brown  wrote:
> 
> Barry Smith  writes:
> 
>>  Thanks Mark, far more interesting. I've improved the formatting to make it 
>> easier to read (and fixed width font for email reading)
>> 
>>  * Can you do same run with say 10 iterations of Jacobi PC?
>> 
>>  * PCApply performance (looks like GAMG) is terrible! Problems too small?
> 
> This is -pc_type jacobi.

  Dang, how come it doesn't warn about all the gamg arguments passed to the 
program? I saw them and jump to the wrong conclusion.

  How come PCApply is so low while Pointwise mult (which should be all of 
PCApply) is high?

  
> 
>>  * VecScatter time is completely dominated by SFPack! Junchao what's up with 
>> that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
>> where that is coming from.
> 
> It's all in MatMult.
> 
> I'd like to see a run that doesn't wait for the GPU.

  Indeed

> 
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 
>> MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
>> 1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212  1 2.04e-04
>> 0 0.00e+00 100
>> KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
>> 6.0e+02 12100 99 79 94 100100100100100 449,667893,741  1 2.04e-04
>> 0 0.00e+00 100
>> PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
>> 2.0e+00  2  1  0  0  0  18  1  0  0  0  14,55816,3941  0 0.00e+00
>> 0 0.00e+00 100
>> VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716353,914  0 0.00e+00
>> 0 0.00e+00 100
>> VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
>> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325303,155  0 0.00e+00
>> 0 0.00e+00 100
>> VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091514,744  0 0.00e+00
>> 0 0.00e+00 100
>> VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323532,889  0 0.00e+00
>> 0 0.00e+00 100
>> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882290,088  0 0.00e+00
>> 0 0.00e+00 100
>> VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
>> 1.0e+00  2  0 99 79  0  19  0100100  0   0  0  1 2.04e-04
>> 0 0.00e+00  0
>> VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   1  0  0  0  0   0  0  0 0.00e+00
>> 0 0.00e+00  0
>> SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0  18  0  0  0  0   0  0  1 2.04e-04
>> 0 0.00e+00  0
>> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   0  0  0 0.00e+00
>> 0 0.00e+00  0
>> 
>> 
>>> On Jan 25, 2022, at 8:29 AM, Mark Adams  wrote:
>>> 
>>> adding Suyash,
>>> 
>>> I found the/a problem. Using ex56, which has a crappy decomposition, using 
>>> one MPI process/GPU is much faster than using 8 (64 total). (I am looking 
>>> at ex13 to see how much of this is due to the decomposition)
>>> If you only use 8 processes it seems that all 8 are put on the first GPU, 
>>> but adding -c8 seems to fix this.
>>> Now the numbers are looking reasonable.
>>> 
>>> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith >> > wrote:
>>> 
>>>  For this, to start, someone can run 
>>> 
>>> src/vec/vec/tutorials/performance.c 
>>> 
>>> and compare the performance to that in the technical report Evaluation of 
>>> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
>>> Vector Node Performance. Google to find. One does not have to and shouldn't 
>>> do an extensive study right now that compares everything, instead one 
>>> should run a very small number of different size problems (make them big) 
>>> and compare those sizes with what Summit gives. Note you will need to make 
>>> sure that performance.c uses the Kokkos backend.
>>> 
>>>  One hopes for better performance than Summit; if one gets tons worse we 
>>> know something is very wrong somewhere. I'd love to see some 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith  writes:

>   Thanks Mark, far more interesting. I've improved the formatting to make it 
> easier to read (and fixed width font for email reading)
>
>   * Can you do same run with say 10 iterations of Jacobi PC?
>
>   * PCApply performance (looks like GAMG) is terrible! Problems too small?

This is -pc_type jacobi.

>   * VecScatter time is completely dominated by SFPack! Junchao what's up with 
> that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
> where that is coming from.

It's all in MatMult.

I'd like to see a run that doesn't wait for the GPU.

> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
>
> MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
> 1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212  1 2.04e-04
> 0 0.00e+00 100
> KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
> 6.0e+02 12100 99 79 94 100100100100100 449,667893,741  1 2.04e-04
> 0 0.00e+00 100
> PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+00  2  1  0  0  0  18  1  0  0  0  14,55816,3941  0 0.00e+00
> 0 0.00e+00 100
> VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716353,914  0 0.00e+00
> 0 0.00e+00 100
> VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325303,155  0 0.00e+00
> 0 0.00e+00 100
> VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091514,744  0 0.00e+00
> 0 0.00e+00 100
> VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323532,889  0 0.00e+00
> 0 0.00e+00 100
> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882290,088  0 0.00e+00
> 0 0.00e+00 100
> VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
> 1.0e+00  2  0 99 79  0  19  0100100  0   0  0  1 2.04e-04
> 0 0.00e+00  0
> VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0   0  0  0 0.00e+00
> 0 0.00e+00  0
> SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  18  0  0  0  0   0  0  1 2.04e-04
> 0 0.00e+00  0
> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   0  0  0 0.00e+00
> 0 0.00e+00  0
>
>
>> On Jan 25, 2022, at 8:29 AM, Mark Adams  wrote:
>> 
>> adding Suyash,
>> 
>> I found the/a problem. Using ex56, which has a crappy decomposition, using 
>> one MPI process/GPU is much faster than using 8 (64 total). (I am looking at 
>> ex13 to see how much of this is due to the decomposition)
>> If you only use 8 processes it seems that all 8 are put on the first GPU, 
>> but adding -c8 seems to fix this.
>> Now the numbers are looking reasonable.
>> 
>> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith > > wrote:
>> 
>>   For this, to start, someone can run 
>> 
>> src/vec/vec/tutorials/performance.c 
>> 
>> and compare the performance to that in the technical report Evaluation of 
>> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
>> Vector Node Performance. Google to find. One does not have to and shouldn't 
>> do an extensive study right now that compares everything, instead one should 
>> run a very small number of different size problems (make them big) and 
>> compare those sizes with what Summit gives. Note you will need to make sure 
>> that performance.c uses the Kokkos backend.
>> 
>>   One hopes for better performance than Summit; if one gets tons worse we 
>> know something is very wrong somewhere. I'd love to see some comparisons.
>> 
>>   Barry
>> 
>> 
>>> On Jan 24, 2022, at 3:06 PM, Justin Chang >> > wrote:
>>> 
>>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>>> you do. 
>>> 
>>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>>> PETSc. We have several teams actively working on rectifying this. However, 
>>> I think 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith
  Thanks Mark, far more interesting. I've improved the formatting to make it 
easier to read (and fixed width font for email reading)

  * Can you do same run with say 10 iterations of Jacobi PC?

  * PCApply performance (looks like GAMG) is terrible! Problems too small?

  * VecScatter time is completely dominated by SFPack! Junchao what's up with 
that? Lots of little kernels in the PCApply? PCJACOBI run will help clarify 
where that is coming from.


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---

MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
1.0e+00  9 92 99 79  0  71 92100100  0 579,635  1,014,212  1 2.04e-040 
0.00e+00 100
KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
6.0e+02 12100 99 79 94 100100100100100 449,667893,741  1 2.04e-040 
0.00e+00 100
PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
2.0e+00  2  1  0  0  0  18  1  0  0  0  14,55816,3941  0 0.00e+000 
0.00e+00 100
VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  1  2  0  0 62   5  2  0  0 66 183,716353,914  0 0.00e+000 
0.00e+00 100
VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
2.0e+02  0  1  0  0 31   2  1  0  0 33 222,325303,155  0 0.00e+000 
0.00e+00 100
VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   2  2  0  0  0 427,091514,744  0 0.00e+000 
0.00e+00 100
VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   1  1  0  0  0 432,323532,889  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   1  1  0  0  0 235,882290,088  0 0.00e+000 
0.00e+00 100
VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
1.0e+00  2  0 99 79  0  19  0100100  0   0  0  1 2.04e-040 
0.00e+00  0
VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   1  0  0  0  0   0  0  0 0.00e+000 
0.00e+00  0
SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  18  0  0  0  0   0  0  1 2.04e-040 
0.00e+00  0
SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   0  0  0 0.00e+000 
0.00e+00  0


> On Jan 25, 2022, at 8:29 AM, Mark Adams  wrote:
> 
> adding Suyash,
> 
> I found the/a problem. Using ex56, which has a crappy decomposition, using 
> one MPI process/GPU is much faster than using 8 (64 total). (I am looking at 
> ex13 to see how much of this is due to the decomposition)
> If you only use 8 processes it seems that all 8 are put on the first GPU, but 
> adding -c8 seems to fix this.
> Now the numbers are looking reasonable.
> 
> On Mon, Jan 24, 2022 at 3:24 PM Barry Smith  > wrote:
> 
>   For this, to start, someone can run 
> 
> src/vec/vec/tutorials/performance.c 
> 
> and compare the performance to that in the technical report Evaluation of 
> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
> Vector Node Performance. Google to find. One does not have to and shouldn't 
> do an extensive study right now that compares everything, instead one should 
> run a very small number of different size problems (make them big) and 
> compare those sizes with what Summit gives. Note you will need to make sure 
> that performance.c uses the Kokkos backend.
> 
>   One hopes for better performance than Summit; if one gets tons worse we 
> know something is very wrong somewhere. I'd love to see some comparisons.
> 
>   Barry
> 
> 
>> On Jan 24, 2022, at 3:06 PM, Justin Chang > > wrote:
>> 
>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>> you do. 
>> 
>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>> PETSc. We have several teams actively working on rectifying this. However, I 
>> think performance issues can be quicker to identify if we had a more 
>> "official" and reproducible PETSc GPU benchmark, which I've already 
>> expressed to some folks in this thread, and as others already commented on 
>> the difficulty of such a task. Hopefully I will have more 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
>
>
>
> > VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235882   290088  0 0.00e+000
> 0.00e+00 100
> > VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04
> 1.0e+00  2  0 99 79  0  19  0100100  0 0   0  1 2.04e-040
> 0.00e+00  0
> > VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> I'm curious how these change with problem size. (To what extent are we
> latency vs bandwidth limited?)
>
>
I am getting a segv in ex13 now, a Kokkos view in Plex, but will do scaling
tests when I get it going again.
(trying to get GAMG scaling for Todd by the 3rd)



> > SFSetUp1 1.0 1.3015e-03 1.3 0.00e+00 0.0 1.1e+02 1.7e+04
> 1.0e+00  0  0  1  0  0   0  0  1  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0  18  0  0  0  0 0   0  1 2.04e-040
> 0.00e+00  0
> > SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Mark Adams  writes:

> adding Suyash,
>
> I found the/a problem. Using ex56, which has a crappy decomposition, using
> one MPI process/GPU is much faster than using 8 (64 total). (I am looking
> at ex13 to see how much of this is due to the decomposition)
> If you only use 8 processes it seems that all 8 are put on the first GPU,
> but adding -c8 seems to fix this.
> Now the numbers are looking reasonable.

Hah, we need -log_view to report bus ID for each GPU so we don't spend another 
day of mailing list traffic to identify.

This looks to be 2-3x the performance of Spock.

> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---

[...]

> --- Event Stage 2: Solve
>
> BuildTwoSided  1 1.0 9.1706e-05 1.6 0.00e+00 0.0 5.6e+01 4.0e+00 
> 1.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> MatMult  200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 
> 1.0e+00  9 92 99 79  0  71 92100100  0 579635   1014212  1 2.04e-040 
> 0.00e+00 100

GPU compute bandwidth of around 6 TB/s is okay, but disappointing that 
communication is so expensive.

> MatView1 1.0 7.8531e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> KSPSolve   1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 
> 6.0e+02 12100 99 79 94 100100100100100 449667   893741  1 2.04e-040 
> 0.00e+00 100
> PCApply  201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+00  2  1  0  0  0  18  1  0  0  0 14558   163941  0 0.00e+000 
> 0.00e+00 100
> VecTDot  401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  1  2  0  0 62   5  2  0  0 66 183716   353914  0 0.00e+000 
> 0.00e+00 100
> VecNorm  201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+02  0  1  0  0 31   2  1  0  0 33 222325   303155  0 0.00e+000 
> 0.00e+00 100
> VecCopy2 1.0 2.3551e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecSet 1 1.0 9.8740e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecAXPY  400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   2  2  0  0  0 427091   514744  0 0.00e+000 
> 0.00e+00 100
> VecAYPX  199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 432323   532889  0 0.00e+000 
> 0.00e+00 100

These two are finally about the same speed, but these numbers imply kernel 
overhead of about 57 µs (because these do nothing else).

> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0 235882   290088  0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 
> 1.0e+00  2  0 99 79  0  19  0100100  0 0   0  1 2.04e-040 
> 0.00e+00  0
> VecScatterEnd200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0

I'm curious how these change with problem size. (To what extent are we latency 
vs bandwidth limited?)

> SFSetUp1 1.0 1.3015e-03 1.3 0.00e+00 0.0 1.1e+02 1.7e+04 
> 1.0e+00  0  0  1  0  0   0  0  1  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFPack   200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  18  0  0  0  0 0   0  1 2.04e-040 
> 0.00e+00  0
> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
adding Suyash,

I found the/a problem. Using ex56, which has a crappy decomposition, using
one MPI process/GPU is much faster than using 8 (64 total). (I am looking
at ex13 to see how much of this is due to the decomposition)
If you only use 8 processes it seems that all 8 are put on the first GPU,
but adding -c8 seems to fix this.
Now the numbers are looking reasonable.

On Mon, Jan 24, 2022 at 3:24 PM Barry Smith  wrote:

>
>   For this, to start, someone can run
>
> src/vec/vec/tutorials/performance.c
>
> and compare the performance to that in the technical report Evaluation of
> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I:
> Vector Node Performance. Google to find. One does not have to and shouldn't
> do an extensive study right now that compares everything, instead one
> should run a very small number of different size problems (make them big)
> and compare those sizes with what Summit gives. Note you will need to make
> sure that performance.c uses the Kokkos backend.
>
>   One hopes for better performance than Summit; if one gets tons worse we
> know something is very wrong somewhere. I'd love to see some comparisons.
>
>   Barry
>
>
> On Jan 24, 2022, at 3:06 PM, Justin Chang  wrote:
>
> Also, do you guys have an OLCF liaison? That's actually your better bet if
> you do.
>
> Performance issues with ROCm/Kokkos are pretty common in apps besides just
> PETSc. We have several teams actively working on rectifying this. However,
> I think performance issues can be quicker to identify if we had a more
> "official" and reproducible PETSc GPU benchmark, which I've already
> expressed to some folks in this thread, and as others already commented on
> the difficulty of such a task. Hopefully I will have more time soon to
> illustrate what I am thinking.
>
> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang  wrote:
>
>> My name has been called.
>>
>> Mark, if you're having issues with Crusher, please contact Veronica
>> Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in
>> those emails
>>
>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith  wrote:
>>
>>>
>>>
>>> On Jan 24, 2022, at 2:46 PM, Mark Adams  wrote:
>>>
>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
>>> run this on one processor to get cleaner numbers.
>>>
>>> Is there a designated ECP technical support contact?
>>>
>>>
>>>Mark, you've forgotten you work for DOE. There isn't a non-ECP
>>> technical support contact.
>>>
>>>But if this is an AMD machine then maybe contact Matt's student
>>> Justin Chang?
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith  wrote:
>>>

   I think you should contact the crusher ECP technical support team and
 tell them you are getting dismel performance and ask if you should expect
 better. Don't waste time flogging a dead horse.

 On Jan 24, 2022, at 2:16 PM, Matthew Knepley  wrote:

 On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
 wrote:

>
>
> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <
>> junchao.zh...@gmail.com> wrote:
>>
>>> Mark, I think you can benchmark individual vector operations, and
>>> once we get reasonable profiling results, we can move to solvers etc.
>>>
>>
>> Can you suggest a code to run or are you suggesting making a vector
>> benchmark code?
>>
> Make a vector benchmark code, testing vector operations that would be
> used in your solver.
> Also, we can run MatMult() to see if the profiling result is
> reasonable.
> Only once we get some solid results on basic operations, it is useful
> to run big codes.
>

 So we have to make another throw-away code? Why not just look at the
 vector ops in Mark's actual code?

Matt


>
>>
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>>>


 On Mon, Jan 24, 2022 at 12:44 PM Barry Smith 
 wrote:

>
>   Here except for VecNorm the GPU is used effectively in that most
> of the time is time is spent doing real work on the GPU
>
> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
> 0.00e+000 0.00e+00 100
>
> Even the dots are very effective, only the VecNorm flop rate over
> the full time is much much lower than the vecdot. Which is somehow 
> due to
> the use of the GPU or CPU MPI in the allreduce?
>

 The VecNorm GPU rate is relatively high on Crusher and the CPU rate
 is about the same as the other vec ops. I don't know what to make of 
 that.

 But Crusher is clearly not crushing it.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith

  Sure, this is definitely not for the public, it is just numbers one can give 
to OLCF, AMD, and Kokkos to ensure things are as they should be going to.


> On Jan 24, 2022, at 3:30 PM, Munson, Todd  wrote:
> 
> I want to note that crusher is early access hardware, so we should expect 
> performance to not be great right now.  Doing what we can to help identify 
> the performance issues and keeping OLCF informed would be the best.
>  
> Note that we cannot make any of the preliminary results publicly available 
> without explicit permission from OLCF; all of the results have to be 
> considered preliminary and the software stack will undergo a rapid churn.
>  
> All the best, Todd.
>  
> From: petsc-dev  <mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf of Barry Smith 
> mailto:bsm...@petsc.dev>>
> Date: Monday, January 24, 2022 at 2:24 PM
> To: Justin Chang mailto:jychan...@gmail.com>>
> Cc: "petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>" 
> mailto:petsc-dev@mcs.anl.gov>>
> Subject: Re: [petsc-dev] Kokkos/Crusher perforance
>  
>  
>   For this, to start, someone can run 
>  
> src/vec/vec/tutorials/performance.c 
> 
> 
> and compare the performance to that in the technical report Evaluation of 
> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
> Vector Node Performance. Google to find. One does not have to and shouldn't 
> do an extensive study right now that compares everything, instead one should 
> run a very small number of different size problems (make them big) and 
> compare those sizes with what Summit gives. Note you will need to make sure 
> that performance.c uses the Kokkos backend.
>  
>   One hopes for better performance than Summit; if one gets tons worse we 
> know something is very wrong somewhere. I'd love to see some comparisons.
>  
>   Barry
>  
> 
> 
>> On Jan 24, 2022, at 3:06 PM, Justin Chang > <mailto:jychan...@gmail.com>> wrote:
>>  
>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>> you do. 
>> 
>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>> PETSc. We have several teams actively working on rectifying this. However, I 
>> think performance issues can be quicker to identify if we had a more 
>> "official" and reproducible PETSc GPU benchmark, which I've already 
>> expressed to some folks in this thread, and as others already commented on 
>> the difficulty of such a task. Hopefully I will have more time soon to 
>> illustrate what I am thinking.
>>  
>> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang > <mailto:jychan...@gmail.com>> wrote:
>>> My name has been called.
>>>  
>>> Mark, if you're having issues with Crusher, please contact Veronica Vergara 
>>> (vergar...@ornl.gov <mailto:vergar...@ornl.gov>). You can cc me 
>>> (justin.ch...@amd.com <mailto:justin.ch...@amd.com>) in those emails
>>>  
>>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith >> <mailto:bsm...@petsc.dev>> wrote:
>>>>  
>>>> 
>>>> 
>>>>> On Jan 24, 2022, at 2:46 PM, Mark Adams >>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>  
>>>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could 
>>>>> run this on one processor to get cleaner numbers.
>>>>>  
>>>>> Is there a designated ECP technical support contact?
>>>>  
>>>>Mark, you've forgotten you work for DOE. There isn't a non-ECP 
>>>> technical support contact. 
>>>>  
>>>>But if this is an AMD machine then maybe contact Matt's student Justin 
>>>> Chang?
>>>>  
>>>>  
>>>> 
>>>> 
>>>>>  
>>>>>  
>>>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith >>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>  
>>>>>>   I think you should contact the crusher ECP technical support team and 
>>>>>> tell them you are getting dismel performance and ask if you should 
>>>>>> expect better. Don't waste time flogging a dead horse. 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley >>>>>> <mailto:knep...@gmail.com>> wrote:
>>>>>>>  
>>>>>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang >>>>>> <mailto:junchao.zh...@gmai

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Munson, Todd via petsc-dev
I want to note that crusher is early access hardware, so we should expect 
performance to not be great right now.  Doing what we can to help identify the 
performance issues and keeping OLCF informed would be the best.

Note that we cannot make any of the preliminary results publicly available 
without explicit permission from OLCF; all of the results have to be considered 
preliminary and the software stack will undergo a rapid churn.

All the best, Todd.

From: petsc-dev  on behalf of Barry Smith 

Date: Monday, January 24, 2022 at 2:24 PM
To: Justin Chang 
Cc: "petsc-dev@mcs.anl.gov" 
Subject: Re: [petsc-dev] Kokkos/Crusher perforance


  For this, to start, someone can run

src/vec/vec/tutorials/performance.c


and compare the performance to that in the technical report Evaluation of PETSc 
on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: Vector 
Node Performance. Google to find. One does not have to and shouldn't do an 
extensive study right now that compares everything, instead one should run a 
very small number of different size problems (make them big) and compare those 
sizes with what Summit gives. Note you will need to make sure that 
performance.c uses the Kokkos backend.

  One hopes for better performance than Summit; if one gets tons worse we know 
something is very wrong somewhere. I'd love to see some comparisons.

  Barry



On Jan 24, 2022, at 3:06 PM, Justin Chang 
mailto:jychan...@gmail.com>> wrote:

Also, do you guys have an OLCF liaison? That's actually your better bet if you 
do.

Performance issues with ROCm/Kokkos are pretty common in apps besides just 
PETSc. We have several teams actively working on rectifying this. However, I 
think performance issues can be quicker to identify if we had a more "official" 
and reproducible PETSc GPU benchmark, which I've already expressed to some 
folks in this thread, and as others already commented on the difficulty of such 
a task. Hopefully I will have more time soon to illustrate what I am thinking.

On Mon, Jan 24, 2022 at 1:57 PM Justin Chang 
mailto:jychan...@gmail.com>> wrote:
My name has been called.

Mark, if you're having issues with Crusher, please contact Veronica Vergara 
(vergar...@ornl.gov<mailto:vergar...@ornl.gov>). You can cc me 
(justin.ch...@amd.com<mailto:justin.ch...@amd.com>) in those emails

On Mon, Jan 24, 2022 at 1:49 PM Barry Smith 
mailto:bsm...@petsc.dev>> wrote:



On Jan 24, 2022, at 2:46 PM, Mark Adams 
mailto:mfad...@lbl.gov>> wrote:

Yea, CG/Jacobi is as close to a benchmark code as we could want. I could run 
this on one processor to get cleaner numbers.

Is there a designated ECP technical support contact?

   Mark, you've forgotten you work for DOE. There isn't a non-ECP technical 
support contact.

   But if this is an AMD machine then maybe contact Matt's student Justin Chang?






On Mon, Jan 24, 2022 at 2:18 PM Barry Smith 
mailto:bsm...@petsc.dev>> wrote:

  I think you should contact the crusher ECP technical support team and tell 
them you are getting dismel performance and ask if you should expect better. 
Don't waste time flogging a dead horse.


On Jan 24, 2022, at 2:16 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:


On Mon, Jan 24, 2022 at 12:55 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:


On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:
Mark, I think you can benchmark individual vector operations, and once we get 
reasonable profiling results, we can move to solvers etc.

Can you suggest a code to run or are you suggesting making a vector benchmark 
code?
Make a vector benchmark code, testing vector operations that would be used in 
your solver.
Also, we can run MatMult() to see if the profiling result is reasonable.
Only once we get some solid results on basic operations, it is useful to run 
big codes.

So we have to make another throw-away code? Why not just look at the vector ops 
in Mark's actual code?

   Matt



--Junchao Zhang


On Mon, Jan 24, 2022 at 12:09 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:


On Mon, Jan 24, 2022 at 12:44 PM Barry Smith 
mailto:bsm...@petsc.dev>> wrote:

  Here except for VecNorm the GPU is used effectively in that most of the time 
is time is spent doing real work on the GPU

VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000 
0.00e+00 100

Even the dots are very effective, only the VecNorm flop rate over the full time 
is much much lower than the vecdot. Which is somehow due to the use of the GPU 
or CPU MPI in the allreduce?

The VecNorm GPU rate is relatively high on Crusher and the CPU rate is about 
the same as the other vec ops. I don't know what to make of that.

But Crusher is clearl

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith

  For this, to start, someone can run 

src/vec/vec/tutorials/performance.c 

and compare the performance to that in the technical report Evaluation of PETSc 
on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: Vector 
Node Performance. Google to find. One does not have to and shouldn't do an 
extensive study right now that compares everything, instead one should run a 
very small number of different size problems (make them big) and compare those 
sizes with what Summit gives. Note you will need to make sure that 
performance.c uses the Kokkos backend.

  One hopes for better performance than Summit; if one gets tons worse we know 
something is very wrong somewhere. I'd love to see some comparisons.

  Barry


> On Jan 24, 2022, at 3:06 PM, Justin Chang  wrote:
> 
> Also, do you guys have an OLCF liaison? That's actually your better bet if 
> you do. 
> 
> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
> PETSc. We have several teams actively working on rectifying this. However, I 
> think performance issues can be quicker to identify if we had a more 
> "official" and reproducible PETSc GPU benchmark, which I've already expressed 
> to some folks in this thread, and as others already commented on the 
> difficulty of such a task. Hopefully I will have more time soon to illustrate 
> what I am thinking.
> 
> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang  > wrote:
> My name has been called.
> 
> Mark, if you're having issues with Crusher, please contact Veronica Vergara 
> (vergar...@ornl.gov ). You can cc me 
> (justin.ch...@amd.com ) in those emails
> 
> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith  > wrote:
> 
> 
>> On Jan 24, 2022, at 2:46 PM, Mark Adams > > wrote:
>> 
>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could run 
>> this on one processor to get cleaner numbers.
>> 
>> Is there a designated ECP technical support contact?
> 
>Mark, you've forgotten you work for DOE. There isn't a non-ECP technical 
> support contact. 
> 
>But if this is an AMD machine then maybe contact Matt's student Justin 
> Chang?
> 
> 
> 
>> 
>> 
>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith > > wrote:
>> 
>>   I think you should contact the crusher ECP technical support team and tell 
>> them you are getting dismel performance and ask if you should expect better. 
>> Don't waste time flogging a dead horse. 
>> 
>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley >> > wrote:
>>> 
>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang >> > wrote:
>>> 
>>> 
>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams >> > wrote:
>>> 
>>> 
>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang >> > wrote:
>>> Mark, I think you can benchmark individual vector operations, and once we 
>>> get reasonable profiling results, we can move to solvers etc.
>>> 
>>> Can you suggest a code to run or are you suggesting making a vector 
>>> benchmark code?
>>> Make a vector benchmark code, testing vector operations that would be used 
>>> in your solver.
>>> Also, we can run MatMult() to see if the profiling result is reasonable.
>>> Only once we get some solid results on basic operations, it is useful to 
>>> run big codes.
>>> 
>>> So we have to make another throw-away code? Why not just look at the vector 
>>> ops in Mark's actual code?
>>> 
>>>Matt
>>>  
>>>  
>>> 
>>> --Junchao Zhang
>>> 
>>> 
>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams >> > wrote:
>>> 
>>> 
>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith >> > wrote:
>>> 
>>>   Here except for VecNorm the GPU is used effectively in that most of the 
>>> time is time is spent doing real work on the GPU
>>> 
>>> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
>>> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000 
>>> 0.00e+00 100
>>> 
>>> Even the dots are very effective, only the VecNorm flop rate over the full 
>>> time is much much lower than the vecdot. Which is somehow due to the use of 
>>> the GPU or CPU MPI in the allreduce?
>>> 
>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is 
>>> about the same as the other vec ops. I don't know what to make of that.
>>> 
>>> But Crusher is clearly not crushing it. 
>>> 
>>> Junchao: Perhaps we should ask Kokkos if they have any experience with 
>>> Crusher that they can share. They could very well find some low level magic.
>>> 
>>> 
>>> 
>>> 
>>> 
 On Jan 24, 2022, at 12:14 PM, Mark Adams >>> > wrote:
 
 
 
 Mark, can we compare with Spock?
 
  Looks much better. This puts two processes/GPU because there are only 4.
 
>>> 
>>> 
>>> 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
On Mon, Jan 24, 2022 at 2:57 PM Justin Chang  wrote:

> My name has been called.
>
> Mark, if you're having issues with Crusher, please contact Veronica
> Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in
> those emails
>

I have worked with Veronica before.
I'll ask Tood if we have an OLCF liaison. He is checking.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Justin Chang
Also, do you guys have an OLCF liaison? That's actually your better bet if
you do.

Performance issues with ROCm/Kokkos are pretty common in apps besides just
PETSc. We have several teams actively working on rectifying this. However,
I think performance issues can be quicker to identify if we had a more
"official" and reproducible PETSc GPU benchmark, which I've already
expressed to some folks in this thread, and as others already commented on
the difficulty of such a task. Hopefully I will have more time soon to
illustrate what I am thinking.

On Mon, Jan 24, 2022 at 1:57 PM Justin Chang  wrote:

> My name has been called.
>
> Mark, if you're having issues with Crusher, please contact Veronica
> Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in
> those emails
>
> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith  wrote:
>
>>
>>
>> On Jan 24, 2022, at 2:46 PM, Mark Adams  wrote:
>>
>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
>> run this on one processor to get cleaner numbers.
>>
>> Is there a designated ECP technical support contact?
>>
>>
>>Mark, you've forgotten you work for DOE. There isn't a non-ECP
>> technical support contact.
>>
>>But if this is an AMD machine then maybe contact Matt's student Justin
>> Chang?
>>
>>
>>
>>
>>
>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith  wrote:
>>
>>>
>>>   I think you should contact the crusher ECP technical support team and
>>> tell them you are getting dismel performance and ask if you should expect
>>> better. Don't waste time flogging a dead horse.
>>>
>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley  wrote:
>>>
>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
>>> wrote:
>>>


 On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:

>
>
> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
> wrote:
>
>> Mark, I think you can benchmark individual vector operations, and
>> once we get reasonable profiling results, we can move to solvers etc.
>>
>
> Can you suggest a code to run or are you suggesting making a vector
> benchmark code?
>
 Make a vector benchmark code, testing vector operations that would be
 used in your solver.
 Also, we can run MatMult() to see if the profiling result is reasonable.
 Only once we get some solid results on basic operations, it is useful
 to run big codes.

>>>
>>> So we have to make another throw-away code? Why not just look at the
>>> vector ops in Mark's actual code?
>>>
>>>Matt
>>>
>>>

>
>>
>> --Junchao Zhang
>>
>>
>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith 
>>> wrote:
>>>

   Here except for VecNorm the GPU is used effectively in that most
 of the time is time is spent doing real work on the GPU

 VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
 0.00e+000 0.00e+00 100

 Even the dots are very effective, only the VecNorm flop rate over
 the full time is much much lower than the vecdot. Which is somehow due 
 to
 the use of the GPU or CPU MPI in the allreduce?

>>>
>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate
>>> is about the same as the other vec ops. I don't know what to make of 
>>> that.
>>>
>>> But Crusher is clearly not crushing it.
>>>
>>> Junchao: Perhaps we should ask Kokkos if they have any experience
>>> with Crusher that they can share. They could very well find some low 
>>> level
>>> magic.
>>>
>>>
>>>


 On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:



> Mark, can we compare with Spock?
>

  Looks much better. This puts two processes/GPU because there are
 only 4.
 



>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> 
>>>
>>>
>>>
>>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Justin Chang
My name has been called.

Mark, if you're having issues with Crusher, please contact Veronica Vergara
(vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in those emails

On Mon, Jan 24, 2022 at 1:49 PM Barry Smith  wrote:

>
>
> On Jan 24, 2022, at 2:46 PM, Mark Adams  wrote:
>
> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
> run this on one processor to get cleaner numbers.
>
> Is there a designated ECP technical support contact?
>
>
>Mark, you've forgotten you work for DOE. There isn't a non-ECP
> technical support contact.
>
>But if this is an AMD machine then maybe contact Matt's student Justin
> Chang?
>
>
>
>
>
> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith  wrote:
>
>>
>>   I think you should contact the crusher ECP technical support team and
>> tell them you are getting dismel performance and ask if you should expect
>> better. Don't waste time flogging a dead horse.
>>
>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley  wrote:
>>
>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
>> wrote:
>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:
>>>


 On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
 wrote:

> Mark, I think you can benchmark individual vector operations, and once
> we get reasonable profiling results, we can move to solvers etc.
>

 Can you suggest a code to run or are you suggesting making a vector
 benchmark code?

>>> Make a vector benchmark code, testing vector operations that would be
>>> used in your solver.
>>> Also, we can run MatMult() to see if the profiling result is reasonable.
>>> Only once we get some solid results on basic operations, it is useful to
>>> run big codes.
>>>
>>
>> So we have to make another throw-away code? Why not just look at the
>> vector ops in Mark's actual code?
>>
>>Matt
>>
>>
>>>

>
> --Junchao Zhang
>
>
> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith 
>> wrote:
>>
>>>
>>>   Here except for VecNorm the GPU is used effectively in that most
>>> of the time is time is spent doing real work on the GPU
>>>
>>> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
>>> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
>>> 0.00e+000 0.00e+00 100
>>>
>>> Even the dots are very effective, only the VecNorm flop rate over
>>> the full time is much much lower than the vecdot. Which is somehow due 
>>> to
>>> the use of the GPU or CPU MPI in the allreduce?
>>>
>>
>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate
>> is about the same as the other vec ops. I don't know what to make of 
>> that.
>>
>> But Crusher is clearly not crushing it.
>>
>> Junchao: Perhaps we should ask Kokkos if they have any experience
>> with Crusher that they can share. They could very well find some low 
>> level
>> magic.
>>
>>
>>
>>>
>>>
>>> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
>>>
>>>
>>>
 Mark, can we compare with Spock?

>>>
>>>  Looks much better. This puts two processes/GPU because there are
>>> only 4.
>>> 
>>>
>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> 
>>
>>
>>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith


> On Jan 24, 2022, at 2:46 PM, Mark Adams  wrote:
> 
> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could run 
> this on one processor to get cleaner numbers.
> 
> Is there a designated ECP technical support contact?

   Mark, you've forgotten you work for DOE. There isn't a non-ECP technical 
support contact. 

   But if this is an AMD machine then maybe contact Matt's student Justin Chang?



> 
> 
> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith  > wrote:
> 
>   I think you should contact the crusher ECP technical support team and tell 
> them you are getting dismel performance and ask if you should expect better. 
> Don't waste time flogging a dead horse. 
> 
>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley > > wrote:
>> 
>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang > > wrote:
>> 
>> 
>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams > > wrote:
>> 
>> 
>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang > > wrote:
>> Mark, I think you can benchmark individual vector operations, and once we 
>> get reasonable profiling results, we can move to solvers etc.
>> 
>> Can you suggest a code to run or are you suggesting making a vector 
>> benchmark code?
>> Make a vector benchmark code, testing vector operations that would be used 
>> in your solver.
>> Also, we can run MatMult() to see if the profiling result is reasonable.
>> Only once we get some solid results on basic operations, it is useful to run 
>> big codes.
>> 
>> So we have to make another throw-away code? Why not just look at the vector 
>> ops in Mark's actual code?
>> 
>>Matt
>>  
>>  
>> 
>> --Junchao Zhang
>> 
>> 
>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams > > wrote:
>> 
>> 
>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith > > wrote:
>> 
>>   Here except for VecNorm the GPU is used effectively in that most of the 
>> time is time is spent doing real work on the GPU
>> 
>> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000 
>> 0.00e+00 100
>> 
>> Even the dots are very effective, only the VecNorm flop rate over the full 
>> time is much much lower than the vecdot. Which is somehow due to the use of 
>> the GPU or CPU MPI in the allreduce?
>> 
>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is about 
>> the same as the other vec ops. I don't know what to make of that.
>> 
>> But Crusher is clearly not crushing it. 
>> 
>> Junchao: Perhaps we should ask Kokkos if they have any experience with 
>> Crusher that they can share. They could very well find some low level magic.
>> 
>> 
>> 
>> 
>> 
>>> On Jan 24, 2022, at 12:14 PM, Mark Adams >> > wrote:
>>> 
>>> 
>>> 
>>> Mark, can we compare with Spock?
>>> 
>>>  Looks much better. This puts two processes/GPU because there are only 4.
>>> 
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ 
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
run this on one processor to get cleaner numbers.

Is there a designated ECP technical support contact?


On Mon, Jan 24, 2022 at 2:18 PM Barry Smith  wrote:

>
>   I think you should contact the crusher ECP technical support team and
> tell them you are getting dismel performance and ask if you should expect
> better. Don't waste time flogging a dead horse.
>
> On Jan 24, 2022, at 2:16 PM, Matthew Knepley  wrote:
>
> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
> wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:
>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
>>> wrote:
>>>
 Mark, I think you can benchmark individual vector operations, and once
 we get reasonable profiling results, we can move to solvers etc.

>>>
>>> Can you suggest a code to run or are you suggesting making a vector
>>> benchmark code?
>>>
>> Make a vector benchmark code, testing vector operations that would be
>> used in your solver.
>> Also, we can run MatMult() to see if the profiling result is reasonable.
>> Only once we get some solid results on basic operations, it is useful to
>> run big codes.
>>
>
> So we have to make another throw-away code? Why not just look at the
> vector ops in Mark's actual code?
>
>Matt
>
>
>>
>>>

 --Junchao Zhang


 On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:

>
>
> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  wrote:
>
>>
>>   Here except for VecNorm the GPU is used effectively in that most of
>> the time is time is spent doing real work on the GPU
>>
>> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
>> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
>> 0.00e+000 0.00e+00 100
>>
>> Even the dots are very effective, only the VecNorm flop rate over the
>> full time is much much lower than the vecdot. Which is somehow due to the
>> use of the GPU or CPU MPI in the allreduce?
>>
>
> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
> about the same as the other vec ops. I don't know what to make of that.
>
> But Crusher is clearly not crushing it.
>
> Junchao: Perhaps we should ask Kokkos if they have any experience with
> Crusher that they can share. They could very well find some low level 
> magic.
>
>
>
>>
>>
>> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
>>
>>
>>
>>> Mark, can we compare with Spock?
>>>
>>
>>  Looks much better. This puts two processes/GPU because there are
>> only 4.
>> 
>>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith

  I think you should contact the crusher ECP technical support team and tell 
them you are getting dismel performance and ask if you should expect better. 
Don't waste time flogging a dead horse. 

> On Jan 24, 2022, at 2:16 PM, Matthew Knepley  wrote:
> 
> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang  > wrote:
> 
> 
> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  > wrote:
> 
> 
> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang  > wrote:
> Mark, I think you can benchmark individual vector operations, and once we get 
> reasonable profiling results, we can move to solvers etc.
> 
> Can you suggest a code to run or are you suggesting making a vector benchmark 
> code?
> Make a vector benchmark code, testing vector operations that would be used in 
> your solver.
> Also, we can run MatMult() to see if the profiling result is reasonable.
> Only once we get some solid results on basic operations, it is useful to run 
> big codes.
> 
> So we have to make another throw-away code? Why not just look at the vector 
> ops in Mark's actual code?
> 
>Matt
>  
>  
> 
> --Junchao Zhang
> 
> 
> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  > wrote:
> 
> 
> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  > wrote:
> 
>   Here except for VecNorm the GPU is used effectively in that most of the 
> time is time is spent doing real work on the GPU
> 
> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000 
> 0.00e+00 100
> 
> Even the dots are very effective, only the VecNorm flop rate over the full 
> time is much much lower than the vecdot. Which is somehow due to the use of 
> the GPU or CPU MPI in the allreduce?
> 
> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is about 
> the same as the other vec ops. I don't know what to make of that.
> 
> But Crusher is clearly not crushing it. 
> 
> Junchao: Perhaps we should ask Kokkos if they have any experience with 
> Crusher that they can share. They could very well find some low level magic.
> 
> 
> 
> 
> 
>> On Jan 24, 2022, at 12:14 PM, Mark Adams > > wrote:
>> 
>> 
>> 
>> Mark, can we compare with Spock?
>> 
>>  Looks much better. This puts two processes/GPU because there are only 4.
>> 
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Matthew Knepley
On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
wrote:

>
>
> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
>> wrote:
>>
>>> Mark, I think you can benchmark individual vector operations, and once
>>> we get reasonable profiling results, we can move to solvers etc.
>>>
>>
>> Can you suggest a code to run or are you suggesting making a vector
>> benchmark code?
>>
> Make a vector benchmark code, testing vector operations that would be used
> in your solver.
> Also, we can run MatMult() to see if the profiling result is reasonable.
> Only once we get some solid results on basic operations, it is useful to
> run big codes.
>

So we have to make another throw-away code? Why not just look at the vector
ops in Mark's actual code?

   Matt


>
>>
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>>>


 On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  wrote:

>
>   Here except for VecNorm the GPU is used effectively in that most of
> the time is time is spent doing real work on the GPU
>
> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
> 0.00e+000 0.00e+00 100
>
> Even the dots are very effective, only the VecNorm flop rate over the
> full time is much much lower than the vecdot. Which is somehow due to the
> use of the GPU or CPU MPI in the allreduce?
>

 The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
 about the same as the other vec ops. I don't know what to make of that.

 But Crusher is clearly not crushing it.

 Junchao: Perhaps we should ask Kokkos if they have any experience with
 Crusher that they can share. They could very well find some low level 
 magic.



>
>
> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
>
>
>
>> Mark, can we compare with Spock?
>>
>
>  Looks much better. This puts two processes/GPU because there are only
> 4.
> 
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Junchao Zhang
On Mon, Jan 24, 2022 at 12:55 PM Mark Adams  wrote:

>
>
> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
> wrote:
>
>> Mark, I think you can benchmark individual vector operations, and once we
>> get reasonable profiling results, we can move to solvers etc.
>>
>
> Can you suggest a code to run or are you suggesting making a vector
> benchmark code?
>
Make a vector benchmark code, testing vector operations that would be used
in your solver.
Also, we can run MatMult() to see if the profiling result is reasonable.
Only once we get some solid results on basic operations, it is useful to
run big codes.


>
>
>>
>> --Junchao Zhang
>>
>>
>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  wrote:
>>>

   Here except for VecNorm the GPU is used effectively in that most of
 the time is time is spent doing real work on the GPU

 VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0
 0.00e+000 0.00e+00 100

 Even the dots are very effective, only the VecNorm flop rate over the
 full time is much much lower than the vecdot. Which is somehow due to the
 use of the GPU or CPU MPI in the allreduce?

>>>
>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
>>> about the same as the other vec ops. I don't know what to make of that.
>>>
>>> But Crusher is clearly not crushing it.
>>>
>>> Junchao: Perhaps we should ask Kokkos if they have any experience with
>>> Crusher that they can share. They could very well find some low level magic.
>>>
>>>
>>>


 On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:



> Mark, can we compare with Spock?
>

  Looks much better. This puts two processes/GPU because there are only
 4.
 





Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
wrote:

> Mark, I think you can benchmark individual vector operations, and once we
> get reasonable profiling results, we can move to solvers etc.
>

Can you suggest a code to run or are you suggesting making a vector
benchmark code?


>
> --Junchao Zhang
>
>
> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  wrote:
>>
>>>
>>>   Here except for VecNorm the GPU is used effectively in that most of
>>> the time is time is spent doing real work on the GPU
>>>
>>> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00
>>> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000
>>> 0.00e+00 100
>>>
>>> Even the dots are very effective, only the VecNorm flop rate over the
>>> full time is much much lower than the vecdot. Which is somehow due to the
>>> use of the GPU or CPU MPI in the allreduce?
>>>
>>
>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
>> about the same as the other vec ops. I don't know what to make of that.
>>
>> But Crusher is clearly not crushing it.
>>
>> Junchao: Perhaps we should ask Kokkos if they have any experience with
>> Crusher that they can share. They could very well find some low level magic.
>>
>>
>>
>>>
>>>
>>> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
>>>
>>>
>>>
 Mark, can we compare with Spock?

>>>
>>>  Looks much better. This puts two processes/GPU because there are only 4.
>>> 
>>>
>>>
>>>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Junchao Zhang
Mark, I think you can benchmark individual vector operations, and once we
get reasonable profiling results, we can move to solvers etc.

--Junchao Zhang


On Mon, Jan 24, 2022 at 12:09 PM Mark Adams  wrote:

>
>
> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  wrote:
>
>>
>>   Here except for VecNorm the GPU is used effectively in that most of the
>> time is time is spent doing real work on the GPU
>>
>> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00
>> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000
>> 0.00e+00 100
>>
>> Even the dots are very effective, only the VecNorm flop rate over the
>> full time is much much lower than the vecdot. Which is somehow due to the
>> use of the GPU or CPU MPI in the allreduce?
>>
>
> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
> about the same as the other vec ops. I don't know what to make of that.
>
> But Crusher is clearly not crushing it.
>
> Junchao: Perhaps we should ask Kokkos if they have any experience with
> Crusher that they can share. They could very well find some low level magic.
>
>
>
>>
>>
>> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
>>
>>
>>
>>> Mark, can we compare with Spock?
>>>
>>
>>  Looks much better. This puts two processes/GPU because there are only 4.
>> 
>>
>>
>>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
On Mon, Jan 24, 2022 at 12:44 PM Barry Smith  wrote:

>
>   Here except for VecNorm the GPU is used effectively in that most of the
> time is time is spent doing real work on the GPU
>
> VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00
> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000
> 0.00e+00 100
>
> Even the dots are very effective, only the VecNorm flop rate over the full
> time is much much lower than the vecdot. Which is somehow due to the use of
> the GPU or CPU MPI in the allreduce?
>

The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
about the same as the other vec ops. I don't know what to make of that.

But Crusher is clearly not crushing it.

Junchao: Perhaps we should ask Kokkos if they have any experience with
Crusher that they can share. They could very well find some low level magic.



>
>
> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
>
>
>
>> Mark, can we compare with Spock?
>>
>
>  Looks much better. This puts two processes/GPU because there are only 4.
> 
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith

  Here except for VecNorm the GPU is used effectively in that most of the time 
is time is spent doing real work on the GPU

VecNorm  402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393  0 0.00e+000 
0.00e+00 100

Even the dots are very effective, only the VecNorm flop rate over the full time 
is much much lower than the vecdot. Which is somehow due to the use of the GPU 
or CPU MPI in the allreduce?



> On Jan 24, 2022, at 12:14 PM, Mark Adams  wrote:
> 
> 
> 
> Mark, can we compare with Spock?
> 
>  Looks much better. This puts two processes/GPU because there are only 4.
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
>
> Mark, can we compare with Spock?
>

 Looks much better. This puts two processes/GPU because there are only 4.
DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 274625 274625 274625 274625 274625 274625 274625 
274625
  Number of 1-cells per rank: 811200 811200 811200 811200 811200 811200 811200 
811200
  Number of 2-cells per rank: 798720 798720 798720 798720 798720 798720 798720 
798720
  Number of 3-cells per rank: 262144 262144 262144 262144 262144 262144 262144 
262144
Labels:
  celltype: 4 strata with value/size (0 (274625), 1 (811200), 4 (798720), 7 
(262144))
  depth: 4 strata with value/size (0 (274625), 1 (811200), 2 (798720), 3 
(262144))
  marker: 1 strata with value/size (1 (49530))
  Face Sets: 3 strata with value/size (1 (16129), 3 (16129), 6 (16129))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
 
***
***WIDEN YOUR WINDOW TO 160 CHARACTERS.  Use 
'enscript -r -fCourier9' to print this document 
***


-- PETSc 
Performance Summary: 
---

/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
arch-olcf-spock named spock02 with 8 processors, by adams Mon Jan 24 12:08:06 
2022
Using Petsc Development GIT revision: v3.16.3-684-g5e9ef69012  GIT Date: 
2022-01-23 14:51:57 -0800

 Max   Max/Min Avg   Total
Time (sec):   3.245e+02 1.000   3.245e+02
Objects:  1.990e+03 1.027   1.947e+03
Flop: 1.940e+11 1.027   1.915e+11  1.532e+12
Flop/sec: 5.978e+08 1.027   5.900e+08  4.720e+09
MPI Messages: 4.806e+03 1.066   4.571e+03  3.657e+04
MPI Message Lengths:  4.434e+08 1.015   9.611e+04  3.515e+09
MPI Reductions:   1.991e+03 1.000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N 
flop
and VecAXPY() for complex vectors of length N --> 
8N flop

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 3.2139e+02  99.0%  6.0875e+11  39.7%  1.417e+04  38.7%  
1.143e+05   46.1%  7.660e+02  38.5%
 1: PCSetUp: 1.5807e-01   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  KSP Solve only: 2.9564e+00   0.9%  9.2287e+11  60.3%  2.240e+04  61.3%  

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
Not clear how to interpret, the "gpu" FLOP rate for dot and norm are a good 
amount higher (exact details of where the log functions are located can affect 
this) but the over flop rates of them are not much better. Scatter is better 
without GPU MPI. How much of this is noise, need to see statistics from 
multiple runs. Certainly not satisfying.

GPU MPI

MatMult  400 1.0 8.4784e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04 
0.0e+00  2 55 61 54  0  68 91100100  0 98667  139198  0 0.00e+000 
0.00e+00 100
KSPSolve   2 1.0 1.e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04 
1.2e+03  3 60 61 54 60 100100100100100 75509  122610  0 0.00e+000 
0.00e+00 100
VecTDot  802 1.0 1.3863e+00 1.3 3.36e+09 1.0 0.0e+00 0.0e+00 
8.0e+02  0  2  0  0 40  10  3  0  0 67 19186   48762  0 0.00e+000 
0.00e+00 100
VecNorm  402 1.0 9.2933e-01 2.1 1.69e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  0  1  0  0 20   6  1  0  0 33 14345  127332  0 0.00e+000 
0.00e+00 100
VecAXPY  800 1.0 8.2405e-01 1.0 3.36e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   7  3  0  0  0 32195   62486  0 0.00e+000 
0.00e+00 100
VecAYPX  398 1.0 8.6891e-01 1.6 1.67e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   6  1  0  0  0 15190   19019  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 402 1.0 3.5227e-01 1.1 8.43e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  1  0  0  0 18922   39878  0 0.00e+000 
0.00e+00 100
VecScatterBegin  400 1.0 1.1519e+00 2.1 0.00e+00 0.0 2.2e+04 8.5e+04 
0.0e+00  0  0 61 54  0   7  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd400 1.0 1.5642e+00 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  10  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0


MatMult  400 1.0 8.1754e+00 1.0 1.06e+11 1.0 2.2e+04 8.5e+04 
0.0e+00  2 55 61 54  0  65 91100100   102324  133771800 4.74e+02  800 
4.74e+02 100
KSPSolve   2 1.0 1.2605e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04 
1.2e+03  2 60 61 54 60 100100100100100 73214  113908800 4.74e+02  800 
4.74e+02 100
VecTDot  802 1.0 2.0607e+00 1.2 3.36e+09 1.0 0.0e+00 0.0e+00 
8.0e+02  0  2  0  0 40  15  3  0  0 67 12907   25655  0 0.00e+000 
0.00e+00 100
VecNorm  402 1.0 9.5100e-01 2.1 1.69e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  0  1  0  0 20   6  1  0  0 33 14018   96704  0 0.00e+000 
0.00e+00 100
VecAXPY  800 1.0 7.9864e-01 1.1 3.36e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   6  3  0  0  0 33219   65843  0 0.00e+000 
0.00e+00 100
VecAYPX  398 1.0 8.0719e-01 1.7 1.67e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   5  1  0  0  0 16352   21253  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 402 1.0 3.7318e-01 1.1 8.43e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  1  0  0  0 17862   38464  0 0.00e+000 
0.00e+00 100
VecScatterBegin  400 1.0 1.4075e+00 1.8 0.00e+00 0.0 2.2e+04 8.5e+04 
0.0e+00  0  0 61 54  0   9  0100100  0 0   0  0 0.00e+00  800 
4.74e+02  0
VecScatterEnd400 1.0 6.3044e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0800 4.74e+020 
0.00e+00  0


> On Jan 24, 2022, at 10:25 AM, Mark Adams  wrote:
> 
>  
>   Mark,
> 
>  Can you run both with GPU aware MPI?
> 
> 
> Perlmuter fails with GPU aware MPI. I think there are know problems with this 
> that are being worked on.
> 
> And here is Crusher with GPU aware MPI.
>  
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
>   Mark,
>
>  Can you run both with GPU aware MPI?
>
>
Perlmuter fails with GPU aware MPI. I think there are know problems with
this that are being worked on.

And here is Crusher with GPU aware MPI.
DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 274625 274625 274625 274625 274625 274625 274625 
274625
  Number of 1-cells per rank: 811200 811200 811200 811200 811200 811200 811200 
811200
  Number of 2-cells per rank: 798720 798720 798720 798720 798720 798720 798720 
798720
  Number of 3-cells per rank: 262144 262144 262144 262144 262144 262144 262144 
262144
Labels:
  celltype: 4 strata with value/size (0 (274625), 1 (811200), 4 (798720), 7 
(262144))
  depth: 4 strata with value/size (0 (274625), 1 (811200), 2 (798720), 3 
(262144))
  marker: 1 strata with value/size (1 (49530))
  Face Sets: 3 strata with value/size (1 (16129), 3 (16129), 6 (16129))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
 
***
***WIDEN YOUR WINDOW TO 160 CHARACTERS.  Use 
'enscript -r -fCourier9' to print this document 
***


-- PETSc 
Performance Summary: 
---

/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
arch-olcf-crusher named crusher020 with 8 processors, by adams Mon Jan 24 
09:35:28 2022
Using Petsc Development GIT revision: v3.16.3-684-g5e9ef69012  GIT Date: 
2022-01-23 14:51:57 -0800

 Max   Max/Min Avg   Total
Time (sec):   3.756e+02 1.000   3.756e+02
Objects:  1.990e+03 1.027   1.947e+03
Flop: 1.940e+11 1.027   1.915e+11  1.532e+12
Flop/sec: 5.165e+08 1.027   5.098e+08  4.078e+09
MPI Messages: 4.806e+03 1.066   4.571e+03  3.657e+04
MPI Message Lengths:  4.434e+08 1.015   9.611e+04  3.515e+09
MPI Reductions:   1.991e+03 1.000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N 
flop
and VecAXPY() for complex vectors of length N --> 
8N flop

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 3.6338e+02  96.8%  6.0875e+11  39.7%  1.417e+04  38.7%  
1.143e+05   46.1%  7.660e+02  38.5%
 1: PCSetUp: 1.8507e-01   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Junchao Zhang
On Sun, Jan 23, 2022 at 11:22 PM Barry Smith  wrote:

>
>
> On Jan 24, 2022, at 12:16 AM, Junchao Zhang 
> wrote:
>
>
>
> On Sun, Jan 23, 2022 at 10:44 PM Barry Smith  wrote:
>
>>
>>   Junchao,
>>
>>  Without GPU aware MPI, is it moving the entire vector to the CPU and
>> doing the scatter and moving everything back or does it just move up
>> exactly what needs to be sent to the other ranks and move back exactly what
>> it received from other ranks?
>>
> It only moves entries needed, using a kernel to pack/unpack them.
>
>
> Ok, that pack kernel is Kokkos?  How come the pack times take so
> little time compared to the MPI sends in the locks those times are much
> smaller than the VecScatter times? Is the logging correct for how much
> stuff is sent up and down?
>
Yes, the pack/unpack kernels are kokkos.  I need to check the profiling.


>
>
>> It is moving 4.74e+02 * 1e+6 bytes total data up and then down. Is
>> that a reasonable amount?
>>
>> Why is it moving 800 distinct counts up and 800 distinct counts down
>> when the MatMult is done 400 times, shouldn't it be 400 counts?
>>
>>   Mark,
>>
>>  Can you run both with GPU aware MPI?
>>
>>
>>   Norm, AXPY, pointwisemult roughly the same.
>>
>>
>> On Jan 23, 2022, at 11:24 PM, Mark Adams  wrote:
>>
>> Ugh, try again. Still a big difference, but less.  Mat-vec does not
>> change much.
>>
>> On Sun, Jan 23, 2022 at 7:12 PM Barry Smith  wrote:
>>
>>>
>>>  You have debugging turned on on crusher but not permutter
>>>
>>> On Jan 23, 2022, at 6:37 PM, Mark Adams  wrote:
>>>
>>> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq
>>> test. (small)
>>> This is with 8 processes.
>>>
>>> * The next largest version of this test, 16M eq total and 8 processes,
>>> fails in memory allocation in the mat-mult setup in the Kokkos Mat.
>>>
>>> * If I try to run with 64 processes on Perlmutter I get this error in
>>> initialization. These nodes have 160 Gb of memory.
>>> (I assume this is related to these large memory requirements from
>>> loading packages, etc)
>>>
>>> Thanks,
>>> Mark
>>>
>>> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13
>>> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4
>>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
>>> -dm_refine 6 -dm_view -pc_type jacobi -log
>>> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos
>>> -dm_vec_type kokkos -log_trace
>>> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
>>> [48]PETSC ERROR: - Error Message
>>> --
>>> [48]PETSC ERROR: GPU error
>>> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
>>> [48]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>> shooting.
>>> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8
>>>  GIT Date: 2022-01-22 12:18:02 -0600
>>> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13
>>> on a arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan
>>> 23 15:19:56 2022
>>> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
>>> -rdynamic -DLANDAU_DIM=2 -DLAN
>>> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC
>>> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
>>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
>>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
>>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
>>> --with-zlib=1 --download-kokkos --download-kokkos-kernels
>>> --with-kokkos-kernels-tpl=0 --with-
>>> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>>> [48]PETSC ERROR: #1 initialize() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
>>> [48]PETSC ERROR: #2 initialize() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
>>> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
>>> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
>>> [48]PETSC ERROR: #5 PetscInitialize_Common() at
>>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
>>> [48]PETSC ERROR: #6 PetscInitialize() at
>>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>>>
>>>
>>> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams  wrote:
>>>


 On Sat, Jan 22, 2022 at 6:22 PM Barry Smith  wrote:

>
>I cleaned up Mark's last run and put it in a fixed-width font. I
> realize this may be too difficult but it would be great to have identical

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jed Brown
Barry Smith  writes:

>   We should make it easy to turn off the logging and synchronizations (from 
> PetscLogGpu) for everything Vec and below, and everything Mat and below to 
> remove all the synchronizations needed for the low level timing. I think we 
> can do that by having  PetscLogGpu take a PETSc class id argument.

Or take the PetscObject as an argument, from which we can get the class ID, and 
if we ever want to allow per-object customization, won't have to revisit these 
interfaces.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith


> On Jan 24, 2022, at 12:16 AM, Junchao Zhang  wrote:
> 
> 
> 
> On Sun, Jan 23, 2022 at 10:44 PM Barry Smith  > wrote:
> 
>   Junchao,
> 
>  Without GPU aware MPI, is it moving the entire vector to the CPU and 
> doing the scatter and moving everything back or does it just move up exactly 
> what needs to be sent to the other ranks and move back exactly what it 
> received from other ranks?
> It only moves entries needed, using a kernel to pack/unpack them.

Ok, that pack kernel is Kokkos?  How come the pack times take so little 
time compared to the MPI sends in the locks those times are much smaller than 
the VecScatter times? Is the logging correct for how much stuff is sent up and 
down?

> 
> It is moving 4.74e+02 * 1e+6 bytes total data up and then down. Is that a 
> reasonable amount?
> 
> Why is it moving 800 distinct counts up and 800 distinct counts down when 
> the MatMult is done 400 times, shouldn't it be 400 counts?
> 
>   Mark,
> 
>  Can you run both with GPU aware MPI?
> 
>
>   Norm, AXPY, pointwisemult roughly the same.
> 
> 
>> On Jan 23, 2022, at 11:24 PM, Mark Adams > > wrote:
>> 
>> Ugh, try again. Still a big difference, but less.  Mat-vec does not change 
>> much.
>> 
>> On Sun, Jan 23, 2022 at 7:12 PM Barry Smith > > wrote:
>> 
>>  You have debugging turned on on crusher but not permutter
>> 
>>> On Jan 23, 2022, at 6:37 PM, Mark Adams >> > wrote:
>>> 
>>> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test. 
>>> (small)
>>> This is with 8 processes. 
>>> 
>>> * The next largest version of this test, 16M eq total and 8 processes, 
>>> fails in memory allocation in the mat-mult setup in the Kokkos Mat.
>>> 
>>> * If I try to run with 64 processes on Perlmutter I get this error in 
>>> initialization. These nodes have 160 Gb of memory.
>>> (I assume this is related to these large memory requirements from loading 
>>> packages, etc)
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13 
>>> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4 
>>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 
>>> -dm_refine 6 -dm_view -pc_type jacobi -log
>>> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos 
>>> -dm_vec_type kokkos -log_trace
>>> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
>>> [48]PETSC ERROR: - Error Message 
>>> --
>>> [48]PETSC ERROR: GPU error 
>>> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
>>> [48]PETSC ERROR: See https://petsc.org/release/faq/ 
>>>  for trouble shooting.
>>> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8  
>>> GIT Date: 2022-01-22 12:18:02 -0600
>>> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on a 
>>> arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23 
>>> 15:19:56 2022
>>> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2 
>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2 
>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler 
>>> -rdynamic -DLANDAU_DIM=2 -DLAN
>>> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC 
>>> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91 
>>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc 
>>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1 
>>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1 
>>> --with-zlib=1 --download-kokkos --download-kokkos-kernels 
>>> --with-kokkos-kernels-tpl=0 --with-
>>> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>>> [48]PETSC ERROR: #1 initialize() at 
>>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
>>> [48]PETSC ERROR: #2 initialize() at 
>>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
>>> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at 
>>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
>>> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at 
>>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
>>> [48]PETSC ERROR: #5 PetscInitialize_Common() at 
>>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
>>> [48]PETSC ERROR: #6 PetscInitialize() at 
>>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>>> 
>>> 
>>> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams >> > wrote:
>>> 
>>> 
>>> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith >> > wrote:
>>> 
>>>I cleaned up Mark's last run and put it in a fixed-width font. I realize 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Junchao Zhang
On Sun, Jan 23, 2022 at 10:44 PM Barry Smith  wrote:

>
>   Junchao,
>
>  Without GPU aware MPI, is it moving the entire vector to the CPU and
> doing the scatter and moving everything back or does it just move up
> exactly what needs to be sent to the other ranks and move back exactly what
> it received from other ranks?
>
It only moves entries needed, using a kernel to pack/unpack them.

>
> It is moving 4.74e+02 * 1e+6 bytes total data up and then down. Is
> that a reasonable amount?
>
> Why is it moving 800 distinct counts up and 800 distinct counts down
> when the MatMult is done 400 times, shouldn't it be 400 counts?
>
>   Mark,
>
>  Can you run both with GPU aware MPI?
>
>
>   Norm, AXPY, pointwisemult roughly the same.
>
>
> On Jan 23, 2022, at 11:24 PM, Mark Adams  wrote:
>
> Ugh, try again. Still a big difference, but less.  Mat-vec does not change
> much.
>
> On Sun, Jan 23, 2022 at 7:12 PM Barry Smith  wrote:
>
>>
>>  You have debugging turned on on crusher but not permutter
>>
>> On Jan 23, 2022, at 6:37 PM, Mark Adams  wrote:
>>
>> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq
>> test. (small)
>> This is with 8 processes.
>>
>> * The next largest version of this test, 16M eq total and 8 processes,
>> fails in memory allocation in the mat-mult setup in the Kokkos Mat.
>>
>> * If I try to run with 64 processes on Perlmutter I get this error in
>> initialization. These nodes have 160 Gb of memory.
>> (I assume this is related to these large memory requirements from loading
>> packages, etc)
>>
>> Thanks,
>> Mark
>>
>> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13
>> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4
>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
>> -dm_refine 6 -dm_view -pc_type jacobi -log
>> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos
>> -dm_vec_type kokkos -log_trace
>> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
>> [48]PETSC ERROR: - Error Message
>> --
>> [48]PETSC ERROR: GPU error
>> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
>> [48]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
>> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8
>>  GIT Date: 2022-01-22 12:18:02 -0600
>> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on
>> a arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23
>> 15:19:56 2022
>> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
>> -rdynamic -DLANDAU_DIM=2 -DLAN
>> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC
>> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
>> --with-zlib=1 --download-kokkos --download-kokkos-kernels
>> --with-kokkos-kernels-tpl=0 --with-
>> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>> [48]PETSC ERROR: #1 initialize() at
>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
>> [48]PETSC ERROR: #2 initialize() at
>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
>> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at
>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
>> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at
>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
>> [48]PETSC ERROR: #5 PetscInitialize_Common() at
>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
>> [48]PETSC ERROR: #6 PetscInitialize() at
>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>>
>>
>> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams  wrote:
>>
>>>
>>>
>>> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith  wrote:
>>>

I cleaned up Mark's last run and put it in a fixed-width font. I
 realize this may be too difficult but it would be great to have identical
 runs to compare with on Summit.

>>>
>>> I was planning on running this on Perlmutter today, as well as some
>>> sanity checks like all GPUs are being used. I'll try PetscDeviceView.
>>>
>>> Junchao modified the timers and all GPU > CPU now, but he seemed to move
>>> the timers more outside and Barry wants them tight on the "kernel".
>>> I think Junchao is going to work on that so I will hold off.
>>> (I removed the the Kokkos wait stuff and seemed to run a little faster
>>> but I am not sure how deterministic the timers are, and I did a test 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith



> On Jan 23, 2022, at 11:47 PM, Jed Brown  wrote:
> 
> Barry Smith via petsc-dev  writes:
> 
>>  The PetscLogGpuTimeBegin()/End was written by Hong so it works with events 
>> to get a GPU timing, it is not suppose to include the CPU kernel launch 
>> times or the time to move the scalar arguments to the GPU. It may not be 
>> perfect but it is the best we can do to capture the time the GPU is actively 
>> doing the numerics, which is what we want.
> 
> As we discussed at the time, collecting the results can be asynchronous and 
> this would be useful to reduce the negative impact of profiling on end-to-end 
> performance.
> 
> But I think what's proposed here is okay because PetscLogGpuTimeBegin() 
> starts counting when the device reaches that point, not when it's given on 
> the host.

  This is how it is suppose to work.

  We should make it easy to turn off the logging and synchronizations (from 
PetscLogGpu) for everything Vec and below, and everything Mat and below to 
remove all the synchronizations needed for the low level timing. I think we can 
do that by having  PetscLogGpu take a PETSc class id argument.



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jed Brown
Barry Smith  writes:

>   Norm, AXPY, pointwisemult roughly the same.

These are where I think we need to start. The bandwidth they are achieving is 
supposed to be possible with just one chiplet.

Mark, can we compare with Spock?


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jed Brown
Barry Smith via petsc-dev  writes:

>   The PetscLogGpuTimeBegin()/End was written by Hong so it works with events 
> to get a GPU timing, it is not suppose to include the CPU kernel launch times 
> or the time to move the scalar arguments to the GPU. It may not be perfect 
> but it is the best we can do to capture the time the GPU is actively doing 
> the numerics, which is what we want.

As we discussed at the time, collecting the results can be asynchronous and 
this would be useful to reduce the negative impact of profiling on end-to-end 
performance.

But I think what's proposed here is okay because PetscLogGpuTimeBegin() starts 
counting when the device reaches that point, not when it's given on the host.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith

  Junchao,

 Without GPU aware MPI, is it moving the entire vector to the CPU and doing 
the scatter and moving everything back or does it just move up exactly what 
needs to be sent to the other ranks and move back exactly what it received from 
other ranks?

It is moving 4.74e+02 * 1e+6 bytes total data up and then down. Is that a 
reasonable amount?

Why is it moving 800 distinct counts up and 800 distinct counts down when 
the MatMult is done 400 times, shouldn't it be 400 counts?

  Mark,

 Can you run both with GPU aware MPI?

   
  Norm, AXPY, pointwisemult roughly the same.


> On Jan 23, 2022, at 11:24 PM, Mark Adams  wrote:
> 
> Ugh, try again. Still a big difference, but less.  Mat-vec does not change 
> much.
> 
> On Sun, Jan 23, 2022 at 7:12 PM Barry Smith  > wrote:
> 
>  You have debugging turned on on crusher but not permutter
> 
>> On Jan 23, 2022, at 6:37 PM, Mark Adams > > wrote:
>> 
>> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test. 
>> (small)
>> This is with 8 processes. 
>> 
>> * The next largest version of this test, 16M eq total and 8 processes, fails 
>> in memory allocation in the mat-mult setup in the Kokkos Mat.
>> 
>> * If I try to run with 64 processes on Perlmutter I get this error in 
>> initialization. These nodes have 160 Gb of memory.
>> (I assume this is related to these large memory requirements from loading 
>> packages, etc)
>> 
>> Thanks,
>> Mark
>> 
>> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13 
>> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4 
>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 
>> 6 -dm_view -pc_type jacobi -log
>> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos -dm_vec_type 
>> kokkos -log_trace
>> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
>> [48]PETSC ERROR: - Error Message 
>> --
>> [48]PETSC ERROR: GPU error 
>> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
>> [48]PETSC ERROR: See https://petsc.org/release/faq/ 
>>  for trouble shooting.
>> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8  
>> GIT Date: 2022-01-22 12:18:02 -0600
>> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on a 
>> arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23 
>> 15:19:56 2022
>> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2 
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2 
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler 
>> -rdynamic -DLANDAU_DIM=2 -DLAN
>> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC 
>> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91 
>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc 
>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1 
>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1 
>> --with-zlib=1 --download-kokkos --download-kokkos-kernels 
>> --with-kokkos-kernels-tpl=0 --with-
>> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>> [48]PETSC ERROR: #1 initialize() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
>> [48]PETSC ERROR: #2 initialize() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
>> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
>> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at 
>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
>> [48]PETSC ERROR: #5 PetscInitialize_Common() at 
>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
>> [48]PETSC ERROR: #6 PetscInitialize() at 
>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>> 
>> 
>> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams > > wrote:
>> 
>> 
>> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith > > wrote:
>> 
>>I cleaned up Mark's last run and put it in a fixed-width font. I realize 
>> this may be too difficult but it would be great to have identical runs to 
>> compare with on Summit.
>> 
>> I was planning on running this on Perlmutter today, as well as some sanity 
>> checks like all GPUs are being used. I'll try PetscDeviceView.
>> 
>> Junchao modified the timers and all GPU > CPU now, but he seemed to move the 
>> timers more outside and Barry wants them tight on the "kernel".
>> I think Junchao is going to work on that so I will hold off.
>> (I removed the the Kokkos wait stuff and seemed to run a little faster but I 
>> am not sure how deterministic the timers are, and I did 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Mark Adams
Ugh, try again. Still a big difference, but less.  Mat-vec does not change
much.

On Sun, Jan 23, 2022 at 7:12 PM Barry Smith  wrote:

>
>  You have debugging turned on on crusher but not permutter
>
> On Jan 23, 2022, at 6:37 PM, Mark Adams  wrote:
>
> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test.
> (small)
> This is with 8 processes.
>
> * The next largest version of this test, 16M eq total and 8 processes,
> fails in memory allocation in the mat-mult setup in the Kokkos Mat.
>
> * If I try to run with 64 processes on Perlmutter I get this error in
> initialization. These nodes have 160 Gb of memory.
> (I assume this is related to these large memory requirements from loading
> packages, etc)
>
> Thanks,
> Mark
>
> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13
> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4
> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
> -dm_refine 6 -dm_view -pc_type jacobi -log
> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos
> -dm_vec_type kokkos -log_trace
> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
> [48]PETSC ERROR: - Error Message
> --
> [48]PETSC ERROR: GPU error
> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
> [48]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8
>  GIT Date: 2022-01-22 12:18:02 -0600
> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on
> a arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23
> 15:19:56 2022
> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
> -rdynamic -DLANDAU_DIM=2 -DLAN
> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC
> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
> --with-zlib=1 --download-kokkos --download-kokkos-kernels
> --with-kokkos-kernels-tpl=0 --with-
> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
> [48]PETSC ERROR: #1 initialize() at
> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
> [48]PETSC ERROR: #2 initialize() at
> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at
> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at
> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
> [48]PETSC ERROR: #5 PetscInitialize_Common() at
> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
> [48]PETSC ERROR: #6 PetscInitialize() at
> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>
>
> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams  wrote:
>
>>
>>
>> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith  wrote:
>>
>>>
>>>I cleaned up Mark's last run and put it in a fixed-width font. I
>>> realize this may be too difficult but it would be great to have identical
>>> runs to compare with on Summit.
>>>
>>
>> I was planning on running this on Perlmutter today, as well as some
>> sanity checks like all GPUs are being used. I'll try PetscDeviceView.
>>
>> Junchao modified the timers and all GPU > CPU now, but he seemed to move
>> the timers more outside and Barry wants them tight on the "kernel".
>> I think Junchao is going to work on that so I will hold off.
>> (I removed the the Kokkos wait stuff and seemed to run a little faster
>> but I am not sure how deterministic the timers are, and I did a test with
>> GAMG and it was fine.)
>>
>>
>>>
>>>As Jed noted Scatter takes a long time but the pack and unpack take
>>> no time? Is this not timed if using Kokkos?
>>>
>>>
>>> --- Event Stage 2: KSP Solve only
>>>
>>> MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04
>>> 0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242  0 0.00e+000
>>> 0.00e+00 100
>>> VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04
>>> 0.0e+00  0  0 61 54  0   7  01001000 0  0 0.00e+000
>>> 0.00e+00  0
>>> VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   9  0  0  00 0  0 0.00e+000
>>> 0.00e+00  0
>>> SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000
>>> 0.00e+00  

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith via petsc-dev


> On Jan 23, 2022, at 10:47 PM, Jacob Faibussowitsch  
> wrote:
> 
>> The outer LogEventBegin/End captures the entire time, including copies, 
>> kernel launches etc.
> 
> Not if the GPU call is asynchronous. To time the call the stream must also be 
> synchronized with the host. The only way to truly time only the kernel calls 
> themselves is to wrap the actual call itself:
> 
> ```
> cublasXaxpy_petsc(…)
> {
>   PetscLogGpuTimeBegin();
>   cublasXaxpy(…);
>   PetscLogGpuTimeEnd();
> }
> ```

  Indeed, they are wrapped as above.

> 
> Note that
> 
> ```
> #define cublasXaxpy_petsc(…) 
> PetscLogGpuTimeBegin();cublasXaxpy(…);PetscLogGpuTimeEnd();
> ```
> 
> Is not sufficient, as this would still include transfers if those transfers 
> happen as direct arguments to the function:
> 
> ```
> cublasXaxpy_petsc(RAII_xfer_to_device(),…);


  I am not sure what you mean here? RAII_xfer_to_device()?  Do you mean unified 
memory transfers down? I don't think we use those.

  The PetscLogGpuTimeBegin()/End was written by Hong so it works with events to 
get a GPU timing, it is not suppose to include the CPU kernel launch times or 
the time to move the scalar arguments to the GPU. It may not be perfect but it 
is the best we can do to capture the time the GPU is actively doing the 
numerics, which is what we want.


> ```
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
>> On Jan 23, 2022, at 21:37, Barry Smith > > wrote:
>> 
>> 
>> 
>>> On Jan 23, 2022, at 10:01 PM, Junchao Zhang >> > wrote:
>>> 
>>> 
>>> 
>>> On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang >> > wrote:
>>> 
>>> 
>>> 
>>> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith >> > wrote:
>>> 
>>>   The GPU flop rate (when 100 percent flops on the GPU) should always be 
>>> higher than the overall flop rate (the previous column). For large problems 
>>> they should be similar, for small problems the GPU one may be much higher.
>>> 
>>>   If the CPU one is higher (when 100 percent flops on the GPU) something 
>>> must be wrong with the logging. I looked at the code for the two cases and 
>>> didn't see anything obvious.
>>> 
>>>   Junchao and Jacob,
>>>   I think some of the timing code in the Kokkos interface is wrong. 
>>> 
>>> *  The PetscLogGpuTimeBegin/End should be inside the viewer access code 
>>> not outside it. (The GPU time is an attempt to best time the kernels, not 
>>> other processing around the use of the kernels, that other stuff is 
>>> captured in the general LogEventBegin/End.
>>> What about potential host to device memory copy before calling a kernel?  
>>> Should we count it in the kernel time?
>> 
>>   Nope, absolutely not. The GPU time represents the time the GPU is doing 
>> active work. The outer LogEventBegin/End captures the entire time, including 
>> copies, kernel launches etc. No reason to put the copy time in the GPU time 
>> because then there would be no need for the GPU since it would be the 
>> LogEventBegin/End. The LogEventBegin/End minus the GPU time represents any 
>> overhead from transfers.
>> 
>> 
>>> 
>>> Good point 
>>> *  The use of WaitForKokkos() is confusing and seems inconsistent. 
>>> I need to have a look. Until now, I have not paid much attention to kokkos 
>>> profiling.
>>>  -For example it is used in VecTDot_SeqKokkos() which I would 
>>> think has a barrier anyways because it puts a scalar result into update? 
>>>  -Plus PetscLogGpuTimeBegin/End is suppose to already have 
>>> suitable system (that Hong added) to ensure the kernel is complete; reading 
>>> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there 
>>> so I don't think WaitForKokkos() is needed in most places (or is Kokkos 
>>> asynchronous and needs this for correctness?) 
>>> But these won't explain the strange result of overall flop rate being 
>>> higher than GPU flop rate.
>>> 
>>>   Barry
>>> 
>>> 
>>> 
>>> 
>>> 
 On Jan 22, 2022, at 11:44 AM, Mark Adams >>> > wrote:
 
 I am getting some funny timings and I'm trying to figure it out.  
 I figure the gPU flop rates are bit higher because the timers are inside 
 of the CPU timers, but some are a lot bigger or inverted 
 
 --- Event Stage 2: KSP Solve only
 
 MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 
 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+00
 0 0.00e+00 100
 MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
 0.00e+00  0
 KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 
 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+00
 0 0.00e+00 100
 SFPack   400 1.0 2.4545e-03 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jacob Faibussowitsch
> The outer LogEventBegin/End captures the entire time, including copies, 
> kernel launches etc.

Not if the GPU call is asynchronous. To time the call the stream must also be 
synchronized with the host. The only way to truly time only the kernel calls 
themselves is to wrap the actual call itself:

```
cublasXaxpy_petsc(…)
{
  PetscLogGpuTimeBegin();
  cublasXaxpy(…);
  PetscLogGpuTimeEnd();
}
```

Note that

```
#define cublasXaxpy_petsc(…) 
PetscLogGpuTimeBegin();cublasXaxpy(…);PetscLogGpuTimeEnd();
```

Is not sufficient, as this would still include transfers if those transfers 
happen as direct arguments to the function:

```
cublasXaxpy_petsc(RAII_xfer_to_device(),…);
```

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Jan 23, 2022, at 21:37, Barry Smith  wrote:
> 
> 
> 
>> On Jan 23, 2022, at 10:01 PM, Junchao Zhang > > wrote:
>> 
>> 
>> 
>> On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang > > wrote:
>> 
>> 
>> 
>> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith > > wrote:
>> 
>>   The GPU flop rate (when 100 percent flops on the GPU) should always be 
>> higher than the overall flop rate (the previous column). For large problems 
>> they should be similar, for small problems the GPU one may be much higher.
>> 
>>   If the CPU one is higher (when 100 percent flops on the GPU) something 
>> must be wrong with the logging. I looked at the code for the two cases and 
>> didn't see anything obvious.
>> 
>>   Junchao and Jacob,
>>   I think some of the timing code in the Kokkos interface is wrong. 
>> 
>> *  The PetscLogGpuTimeBegin/End should be inside the viewer access code 
>> not outside it. (The GPU time is an attempt to best time the kernels, not 
>> other processing around the use of the kernels, that other stuff is captured 
>> in the general LogEventBegin/End.
>> What about potential host to device memory copy before calling a kernel?  
>> Should we count it in the kernel time?
> 
>   Nope, absolutely not. The GPU time represents the time the GPU is doing 
> active work. The outer LogEventBegin/End captures the entire time, including 
> copies, kernel launches etc. No reason to put the copy time in the GPU time 
> because then there would be no need for the GPU since it would be the 
> LogEventBegin/End. The LogEventBegin/End minus the GPU time represents any 
> overhead from transfers.
> 
> 
>> 
>> Good point 
>> *  The use of WaitForKokkos() is confusing and seems inconsistent. 
>> I need to have a look. Until now, I have not paid much attention to kokkos 
>> profiling.
>>  -For example it is used in VecTDot_SeqKokkos() which I would 
>> think has a barrier anyways because it puts a scalar result into update? 
>>  -Plus PetscLogGpuTimeBegin/End is suppose to already have 
>> suitable system (that Hong added) to ensure the kernel is complete; reading 
>> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there 
>> so I don't think WaitForKokkos() is needed in most places (or is Kokkos 
>> asynchronous and needs this for correctness?) 
>> But these won't explain the strange result of overall flop rate being higher 
>> than GPU flop rate.
>> 
>>   Barry
>> 
>> 
>> 
>> 
>> 
>>> On Jan 22, 2022, at 11:44 AM, Mark Adams >> > wrote:
>>> 
>>> I am getting some funny timings and I'm trying to figure it out.  
>>> I figure the gPU flop rates are bit higher because the timers are inside of 
>>> the CPU timers, but some are a lot bigger or inverted 
>>> 
>>> --- Event Stage 2: KSP Solve only
>>> 
>>> MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 
>>> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000 
>>> 0.00e+00 100
>>> MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 
>>> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000 
>>> 0.00e+00 100
>>> SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>>> 0.00e+00  0
>>> VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 
>>> 8.0e+02  0  2  0  0 40  13  3  0  0 67 69996   488328  0 0.00e+000 
>>> 0.00e+00 100
>>> VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 
>>> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000 
>>> 0.00e+00 100
>>> VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith


> On Jan 23, 2022, at 10:01 PM, Junchao Zhang  wrote:
> 
> 
> 
> On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang  > wrote:
> 
> 
> 
> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith  > wrote:
> 
>   The GPU flop rate (when 100 percent flops on the GPU) should always be 
> higher than the overall flop rate (the previous column). For large problems 
> they should be similar, for small problems the GPU one may be much higher.
> 
>   If the CPU one is higher (when 100 percent flops on the GPU) something must 
> be wrong with the logging. I looked at the code for the two cases and didn't 
> see anything obvious.
> 
>   Junchao and Jacob,
>   I think some of the timing code in the Kokkos interface is wrong. 
> 
> *  The PetscLogGpuTimeBegin/End should be inside the viewer access code 
> not outside it. (The GPU time is an attempt to best time the kernels, not 
> other processing around the use of the kernels, that other stuff is captured 
> in the general LogEventBegin/End.
> What about potential host to device memory copy before calling a kernel?  
> Should we count it in the kernel time?

  Nope, absolutely not. The GPU time represents the time the GPU is doing 
active work. The outer LogEventBegin/End captures the entire time, including 
copies, kernel launches etc. No reason to put the copy time in the GPU time 
because then there would be no need for the GPU since it would be the 
LogEventBegin/End. The LogEventBegin/End minus the GPU time represents any 
overhead from transfers.


> 
> Good point 
> *  The use of WaitForKokkos() is confusing and seems inconsistent. 
> I need to have a look. Until now, I have not paid much attention to kokkos 
> profiling.
>  -For example it is used in VecTDot_SeqKokkos() which I would 
> think has a barrier anyways because it puts a scalar result into update? 
>  -Plus PetscLogGpuTimeBegin/End is suppose to already have 
> suitable system (that Hong added) to ensure the kernel is complete; reading 
> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there so 
> I don't think WaitForKokkos() is needed in most places (or is Kokkos 
> asynchronous and needs this for correctness?) 
> But these won't explain the strange result of overall flop rate being higher 
> than GPU flop rate.
> 
>   Barry
> 
> 
> 
> 
> 
>> On Jan 22, 2022, at 11:44 AM, Mark Adams > > wrote:
>> 
>> I am getting some funny timings and I'm trying to figure it out.  
>> I figure the gPU flop rates are bit higher because the timers are inside of 
>> the CPU timers, but some are a lot bigger or inverted 
>> 
>> --- Event Stage 2: KSP Solve only
>> 
>> MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 
>> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000 
>> 0.00e+00 100
>> MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 
>> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000 
>> 0.00e+00 100
>> SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 8.0e+02  0  2  0  0 40  13  3  0  0 67 69996   488328  0 0.00e+000 
>> 0.00e+00 100
>> VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000 
>> 0.00e+00 100
>> VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecAXPY  800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   6  3  0  0  0 247787   448304  0 0.00e+000 
>> 0.00e+00 100
>> VecAYPX  398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030  0 0.00e+000 
>> 0.00e+00 100
>> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413  0 0.00e+000 
>> 0.00e+00 100
>> VecScatterBegin  400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 
>> 0.0e+00  0  0 62 54  0   5  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd400 1.0 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Junchao Zhang
On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang 
wrote:

>
>
>
> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith  wrote:
>
>>
>>   The GPU flop rate (when 100 percent flops on the GPU) should always be
>> higher than the overall flop rate (the previous column). For large problems
>> they should be similar, for small problems the GPU one may be much higher.
>>
>>   If the CPU one is higher (when 100 percent flops on the GPU) something
>> must be wrong with the logging. I looked at the code for the two cases and
>> didn't see anything obvious.
>>
>>   Junchao and Jacob,
>>   I think some of the timing code in the Kokkos interface is wrong.
>>
>> *  The PetscLogGpuTimeBegin/End should be inside the viewer access
>> code not outside it. (The GPU time is an attempt to best time the kernels,
>> not other processing around the use of the kernels, that other stuff is
>> captured in the general LogEventBegin/End.
>>
> What about potential host to device memory copy before calling a kernel?
Should we count it in the kernel time?

Good point
>
>> *  The use of WaitForKokkos() is confusing and seems inconsistent.
>>
> I need to have a look. Until now, I have not paid much attention to kokkos
> profiling.
>
>>  -For example it is used in VecTDot_SeqKokkos() which I would
>> think has a barrier anyways because it puts a scalar result into update?
>>  -Plus PetscLogGpuTimeBegin/End is suppose to already have
>> suitable system (that Hong added) to ensure the kernel is complete; reading
>> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there
>> so I don't think WaitForKokkos() is needed in most places (or is Kokkos
>> asynchronous and needs this for correctness?)
>> But these won't explain the strange result of overall flop rate being
>> higher than GPU flop rate.
>>
>>   Barry
>>
>>
>>
>>
>>
>> On Jan 22, 2022, at 11:44 AM, Mark Adams  wrote:
>>
>> I am getting some funny timings and I'm trying to figure it out.
>> I figure the gPU flop rates are bit higher because the timers are inside
>> of the CPU timers, but *some are a lot bigger or inverted*
>>
>> --- Event Stage 2: KSP Solve only
>>
>> MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04
>> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000
>> 0.00e+00 100
>> MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04
>> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000
>> 0.00e+00 100
>> SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00
>> 8.0e+02  0  2  0  0 40  13  3  0  0 67 *69996   488328*  0 0.00e+00
>>0 0.00e+00 100
>> VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00
>> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000
>> 0.00e+00 100
>> VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecAXPY  800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  2  0  0  0   6  3  0  0  0 *247787   448304*  0 0.00e+00
>>0 0.00e+00 100
>> VecAYPX  398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030  0 0.00e+000
>> 0.00e+00 100
>> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413  0 0.00e+000
>> 0.00e+00 100
>> VecScatterBegin  400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04
>> 0.0e+00  0  0 62 54  0   5  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0  10  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> PCApply  402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138396   262413  0 0.00e+000
>> 0.00e+00 100
>>
>> ---
>>
>>
>> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang 
>> wrote:
>>
>>>
>>>
>>>
>>> On Sat, Jan 22, 2022 at 10:04 AM Mark 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith

 You have debugging turned on on crusher but not permutter

> On Jan 23, 2022, at 6:37 PM, Mark Adams  wrote:
> 
> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test. 
> (small)
> This is with 8 processes. 
> 
> * The next largest version of this test, 16M eq total and 8 processes, fails 
> in memory allocation in the mat-mult setup in the Kokkos Mat.
> 
> * If I try to run with 64 processes on Perlmutter I get this error in 
> initialization. These nodes have 160 Gb of memory.
> (I assume this is related to these large memory requirements from loading 
> packages, etc)
> 
> Thanks,
> Mark
> 
> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13 
> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4 
> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1 -dm_refine 
> 6 -dm_view -pc_type jacobi -log
> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos -dm_vec_type 
> kokkos -log_trace
> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
> [48]PETSC ERROR: - Error Message 
> --
> [48]PETSC ERROR: GPU error 
> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
> [48]PETSC ERROR: See https://petsc.org/release/faq/ 
>  for trouble shooting.
> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8  GIT 
> Date: 2022-01-22 12:18:02 -0600
> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on a 
> arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23 
> 15:19:56 2022
> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2 
> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2 
> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler 
> -rdynamic -DLANDAU_DIM=2 -DLAN
> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC --with-fc=ftn 
> --LDFLAGS=-lmpifort_gnu_91 
> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc 
> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1 
> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1 
> --with-zlib=1 --download-kokkos --download-kokkos-kernels 
> --with-kokkos-kernels-tpl=0 --with-
> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
> [48]PETSC ERROR: #1 initialize() at 
> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
> [48]PETSC ERROR: #2 initialize() at 
> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at 
> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at 
> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
> [48]PETSC ERROR: #5 PetscInitialize_Common() at 
> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
> [48]PETSC ERROR: #6 PetscInitialize() at 
> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
> 
> 
> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams  > wrote:
> 
> 
> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith  > wrote:
> 
>I cleaned up Mark's last run and put it in a fixed-width font. I realize 
> this may be too difficult but it would be great to have identical runs to 
> compare with on Summit.
> 
> I was planning on running this on Perlmutter today, as well as some sanity 
> checks like all GPUs are being used. I'll try PetscDeviceView.
> 
> Junchao modified the timers and all GPU > CPU now, but he seemed to move the 
> timers more outside and Barry wants them tight on the "kernel".
> I think Junchao is going to work on that so I will hold off.
> (I removed the the Kokkos wait stuff and seemed to run a little faster but I 
> am not sure how deterministic the timers are, and I did a test with GAMG and 
> it was fine.)
> 
> 
> 
>As Jed noted Scatter takes a long time but the pack and unpack take no 
> time? Is this not timed if using Kokkos?
> 
> 
> --- Event Stage 2: KSP Solve only
> 
> MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04 
> 0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242  0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04 
> 0.0e+00  0  0 61 54  0   7  01001000 0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   9  0  0  00 0  0 0.00e+000 
> 0.00e+00  0
> SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000 
> 0.00e+00  0
> SFUnpack 400 1.0 6.2653e-05 1.6 0.00e+00 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Mark Adams
* Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test.
(small)
This is with 8 processes.

* The next largest version of this test, 16M eq total and 8 processes,
fails in memory allocation in the mat-mult setup in the Kokkos Mat.

* If I try to run with 64 processes on Perlmutter I get this error in
initialization. These nodes have 160 Gb of memory.
(I assume this is related to these large memory requirements from loading
packages, etc)

Thanks,
Mark

+ srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13
-dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
-dm_refine 6 -dm_view -pc_type jacobi -log
_view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos
-dm_vec_type kokkos -log_trace
+ tee jac_out_001_kokkos_Perlmutter_6_8.txt
[48]PETSC ERROR: - Error Message
--
[48]PETSC ERROR: GPU error
[48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
[48]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8
 GIT Date: 2022-01-22 12:18:02 -0600
[48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13 on a
arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan 23
15:19:56 2022
[48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
-DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
-DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
-rdynamic -DLANDAU_DIM=2 -DLAN
DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC
--with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
--with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
--COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
 --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
--with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
--with-zlib=1 --download-kokkos --download-kokkos-kernels
--with-kokkos-kernels-tpl=0 --with-
make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
[48]PETSC ERROR: #1 initialize() at
/global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
[48]PETSC ERROR: #2 initialize() at
/global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
[48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at
/global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
[48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at
/global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
[48]PETSC ERROR: #5 PetscInitialize_Common() at
/global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
[48]PETSC ERROR: #6 PetscInitialize() at
/global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238


On Sun, Jan 23, 2022 at 8:58 AM Mark Adams  wrote:

>
>
> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith  wrote:
>
>>
>>I cleaned up Mark's last run and put it in a fixed-width font. I
>> realize this may be too difficult but it would be great to have identical
>> runs to compare with on Summit.
>>
>
> I was planning on running this on Perlmutter today, as well as some sanity
> checks like all GPUs are being used. I'll try PetscDeviceView.
>
> Junchao modified the timers and all GPU > CPU now, but he seemed to move
> the timers more outside and Barry wants them tight on the "kernel".
> I think Junchao is going to work on that so I will hold off.
> (I removed the the Kokkos wait stuff and seemed to run a little faster but
> I am not sure how deterministic the timers are, and I did a test with GAMG
> and it was fine.)
>
>
>>
>>As Jed noted Scatter takes a long time but the pack and unpack take no
>> time? Is this not timed if using Kokkos?
>>
>>
>> --- Event Stage 2: KSP Solve only
>>
>> MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04
>> 0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242  0 0.00e+000
>> 0.00e+00 100
>> VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04
>> 0.0e+00  0  0 61 54  0   7  01001000 0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   9  0  0  00 0  0 0.00e+000
>> 0.00e+00  0
>> SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000
>> 0.00e+00  0
>> SFUnpack 400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000
>> 0.00e+00  0
>>
>> KSPSolve   2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04
>> 1.2e+03  3 60 61 54 60 100100100  73,592   116,796  0 0.00e+000
>> 0.00e+00 100
>> VecTDot  802 1.0 1.3551e+00 1.2 3.36e+09 1.0 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Mark Adams
On Sat, Jan 22, 2022 at 6:22 PM Barry Smith  wrote:

>
>I cleaned up Mark's last run and put it in a fixed-width font. I
> realize this may be too difficult but it would be great to have identical
> runs to compare with on Summit.
>

I was planning on running this on Perlmutter today, as well as some sanity
checks like all GPUs are being used. I'll try PetscDeviceView.

Junchao modified the timers and all GPU > CPU now, but he seemed to move
the timers more outside and Barry wants them tight on the "kernel".
I think Junchao is going to work on that so I will hold off.
(I removed the the Kokkos wait stuff and seemed to run a little faster but
I am not sure how deterministic the timers are, and I did a test with GAMG
and it was fine.)


>
>As Jed noted Scatter takes a long time but the pack and unpack take no
> time? Is this not timed if using Kokkos?
>
>
> --- Event Stage 2: KSP Solve only
>
> MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04
> 0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242  0 0.00e+000
> 0.00e+00 100
> VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04
> 0.0e+00  0  0 61 54  0   7  01001000 0  0 0.00e+000
> 0.00e+00  0
> VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   9  0  0  00 0  0 0.00e+000
> 0.00e+00  0
> SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000
> 0.00e+00  0
> SFUnpack 400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000
> 0.00e+00  0
>
> KSPSolve   2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04
> 1.2e+03  3 60 61 54 60 100100100  73,592   116,796  0 0.00e+000
> 0.00e+00 100
> VecTDot  802 1.0 1.3551e+00 1.2 3.36e+09 1.0 0.0e+00 0.0e+00
> 8.0e+02  0  2  0  0 40  10  3  0  19,62752,599  0 0.00e+000
> 0.00e+00 100
> VecNorm  402 1.0 9.0151e-01 2.2 1.69e+09 1.0 0.0e+00 0.0e+00
> 4.0e+02  0  1  0  0 20   5  1  0  0   14,788   125,477  0 0.00e+000
> 0.00e+00 100
> VecAXPY  800 1.0 8.2617e-01 1.0 3.36e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   7  3  0  0   32,11261,644  0 0.00e+000
> 0.00e+00 100
> VecAYPX  398 1.0 8.1525e-01 1.6 1.67e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   5  1  0  0   16,19020,689  0 0.00e+000
> 0.00e+00 100
> VecPointwiseMult 402 1.0 3.5694e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   3  1  0  0   18,67538,633  0 0.00e+000
> 0.00e+00 100
>
>
>
> On Jan 22, 2022, at 12:40 PM, Mark Adams  wrote:
>
> And I have a new MR with if you want to see what I've done so far.
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith


> On Jan 22, 2022, at 10:00 PM, Junchao Zhang  wrote:
> 
> 
> 
> 
> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith  > wrote:
> 
>   The GPU flop rate (when 100 percent flops on the GPU) should always be 
> higher than the overall flop rate (the previous column). For large problems 
> they should be similar, for small problems the GPU one may be much higher.
> 
>   If the CPU one is higher (when 100 percent flops on the GPU) something must 
> be wrong with the logging. I looked at the code for the two cases and didn't 
> see anything obvious.
> 
>   Junchao and Jacob,
>   I think some of the timing code in the Kokkos interface is wrong. 
> 
> *  The PetscLogGpuTimeBegin/End should be inside the viewer access code 
> not outside it. (The GPU time is an attempt to best time the kernels, not 
> other processing around the use of the kernels, that other stuff is captured 
> in the general LogEventBegin/End.
> Good point 
> *  The use of WaitForKokkos() is confusing and seems inconsistent. 
> I need to have a look. Until now, I have not paid much attention to kokkos 
> profiling.

  That is what is so great about Mark. He makes us do what we should have done 
before :-)


>  -For example it is used in VecTDot_SeqKokkos() which I would 
> think has a barrier anyways because it puts a scalar result into update? 
>  -Plus PetscLogGpuTimeBegin/End is suppose to already have 
> suitable system (that Hong added) to ensure the kernel is complete; reading 
> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there so 
> I don't think WaitForKokkos() is needed in most places (or is Kokkos 
> asynchronous and needs this for correctness?) 
> But these won't explain the strange result of overall flop rate being higher 
> than GPU flop rate.
> 
>   Barry
> 
> 
> 
> 
> 
>> On Jan 22, 2022, at 11:44 AM, Mark Adams > > wrote:
>> 
>> I am getting some funny timings and I'm trying to figure it out.  
>> I figure the gPU flop rates are bit higher because the timers are inside of 
>> the CPU timers, but some are a lot bigger or inverted 
>> 
>> --- Event Stage 2: KSP Solve only
>> 
>> MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 
>> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000 
>> 0.00e+00 100
>> MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 
>> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000 
>> 0.00e+00 100
>> SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 8.0e+02  0  2  0  0 40  13  3  0  0 67 69996   488328  0 0.00e+000 
>> 0.00e+00 100
>> VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000 
>> 0.00e+00 100
>> VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecAXPY  800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   6  3  0  0  0 247787   448304  0 0.00e+000 
>> 0.00e+00 100
>> VecAYPX  398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030  0 0.00e+000 
>> 0.00e+00 100
>> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413  0 0.00e+000 
>> 0.00e+00 100
>> VecScatterBegin  400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 
>> 0.0e+00  0  0 62 54  0   5  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0  10  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> PCApply  402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138396   262413  0 0.00e+000 
>> 0.00e+00 100
>> ---
>> 
>> 
>> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang > 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Junchao Zhang
On Sat, Jan 22, 2022 at 5:00 PM Barry Smith  wrote:

>
>   The GPU flop rate (when 100 percent flops on the GPU) should always be
> higher than the overall flop rate (the previous column). For large problems
> they should be similar, for small problems the GPU one may be much higher.
>
>   If the CPU one is higher (when 100 percent flops on the GPU) something
> must be wrong with the logging. I looked at the code for the two cases and
> didn't see anything obvious.
>
>   Junchao and Jacob,
>   I think some of the timing code in the Kokkos interface is wrong.
>
> *  The PetscLogGpuTimeBegin/End should be inside the viewer access
> code not outside it. (The GPU time is an attempt to best time the kernels,
> not other processing around the use of the kernels, that other stuff is
> captured in the general LogEventBegin/End.
>
Good point

> *  The use of WaitForKokkos() is confusing and seems inconsistent.
>
I need to have a look. Until now, I have not paid much attention to kokkos
profiling.

>  -For example it is used in VecTDot_SeqKokkos() which I would
> think has a barrier anyways because it puts a scalar result into update?
>  -Plus PetscLogGpuTimeBegin/End is suppose to already have
> suitable system (that Hong added) to ensure the kernel is complete; reading
> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there
> so I don't think WaitForKokkos() is needed in most places (or is Kokkos
> asynchronous and needs this for correctness?)
> But these won't explain the strange result of overall flop rate being
> higher than GPU flop rate.
>
>   Barry
>
>
>
>
>
> On Jan 22, 2022, at 11:44 AM, Mark Adams  wrote:
>
> I am getting some funny timings and I'm trying to figure it out.
> I figure the gPU flop rates are bit higher because the timers are inside
> of the CPU timers, but *some are a lot bigger or inverted*
>
> --- Event Stage 2: KSP Solve only
>
> MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04
> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000
> 0.00e+00 100
> MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04
> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000
> 0.00e+00 100
> SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00
> 8.0e+02  0  2  0  0 40  13  3  0  0 67 *69996   488328*  0 0.00e+00
>  0 0.00e+00 100
> VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00
> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000
> 0.00e+00 100
> VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecAXPY  800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   6  3  0  0  0 *247787   448304*  0 0.00e+00
>0 0.00e+00 100
> VecAYPX  398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030  0 0.00e+000
> 0.00e+00 100
> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413  0 0.00e+000
> 0.00e+00 100
> VecScatterBegin  400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04
> 0.0e+00  0  0 62 54  0   5  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecScatterEnd400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  10  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> PCApply  402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138396   262413  0 0.00e+000
> 0.00e+00 100
>
> ---
>
>
> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang 
> wrote:
>
>>
>>
>>
>> On Sat, Jan 22, 2022 at 10:04 AM Mark Adams  wrote:
>>
>>> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()
>>> right?
>>>
>> No, PetscLogGpuTime() does not know the flops of the caller.
>>
>>
>>>
>>> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith  wrote:
>>>

   Mark,

   Fix the logging 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith


  I am not arguing for a rickety set of scripts, I am arguing that doing more 
is not so easy and it is only worth doing if the underlying benchmark is worth 
the effort. 

> On Jan 22, 2022, at 8:08 PM, Jed Brown  wrote:
> 
> Yeah, I'm referring to the operational aspect of data management, not 
> benchmark design (which is hard and even Sam had years working with Mark and 
> me on HPGMG to refine that).
> 
> If you run libCEED BPs (which use PETSc), you can run one command
> 
> srun -N ./bps -ceed /cpu/self/xsmm/blocked,/gpu/cuda/gen -degree 2,3,4,5 
> -local_nodes 1000,500 -problem bp1,bp2,bp3,bp4
> 
> and it'll loop (in C code) over all the combinations (reusing some 
> non-benchmarked things like the DMPlex) across the whole range of sizes, 
> problems, devices. It makes one output file and you feed that to a Python 
> script to read it as a Pandas DataFrame and plot (or read and interact in a 
> notebook). You can have a basket of files from different machines and slice 
> those plots without code changes.
> 
> We should do similar for a suite of PETSc benchmarks, even just basic Vec and 
> Mat operations like in the reports. It isn't more work than a rickety bundle 
> of scripts, and it's a lot less error-prone.
> 
> Barry Smith  writes:
> 
>>  I submit it is actually a good amount of additional work and requires real 
>> creativity and very good judgment; it is not a good intro or undergrad 
>> project; especially for someone without a huge amount of hands-on experience 
>> already. Look who had to do the new SpecHPC multigrid benchmark. The last 
>> time I checked Sam was not an undergrad. Senior Scientist, Lawrence Berkeley 
>> National Laboratory - ‪‪Cited by 11194‬‬ I definitely do not plan to involve 
>> myself in any brand new serious benchmarking studies in my current lifetime, 
>> doing one correctly is a massive undertaking IMHO.
>> 
>>> On Jan 22, 2022, at 6:43 PM, Jed Brown  wrote:
>>> 
>>> This isn't so much more or less work, but work in more useful places. Maybe 
>>> this is a good undergrad or intro project to make a clean workflow for 
>>> these experiments.
>>> 
>>> Barry Smith  writes:
>>> 
 Performance studies are enormously difficult to do well; which is why 
 there are so few good ones out there. And unless you fall into the LINPACK 
 benchmark or hit upon Streams the rewards of doing an excellent job are 
 pretty thin. Even Streams was not properly maintained for many years, you 
 could not just get it and use it out of the box for a variety of purposes 
 (which is why PETSc has its hacked-up ones). I submit a properly 
 performance study is a full-time job and everyone always has those.
 
> On Jan 22, 2022, at 2:11 PM, Jed Brown  wrote:
> 
> Barry Smith  writes:
> 
>>> On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
>>> Barry, when you did the tech reports, did you make an example to 
>>> reproduce on other architectures? Like, run this one example (it'll run 
>>> all the benchmarks across different sizes) and then run this script on 
>>> the output to make all the figures?
>> 
>> It is documented in 
>> https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may 
>> need to dig through the submit scripts etc to find out exactly.
> 
> This runs a ton of small jobs and each job doesn't really preload, but 
> instead of loops in job submission scripts, the loops could be inside the 
> C code and it could directly output tabular data. This would run faster 
> and be easier to submit and analyze.
> 
> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf
> 
> It would hopefully also avoid writing the size range manually over here 
> in the analysis script where it has to match exactly the job submission.
> 
> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9
> 
> 
> We'd make our lives a lot easier understanding new machines if we put 
> into the design of performance studies just a fraction of the kind of 
> thought we put into public library interfaces.



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Yeah, I'm referring to the operational aspect of data management, not benchmark 
design (which is hard and even Sam had years working with Mark and me on HPGMG 
to refine that).

If you run libCEED BPs (which use PETSc), you can run one command

srun -N ./bps -ceed /cpu/self/xsmm/blocked,/gpu/cuda/gen -degree 2,3,4,5 
-local_nodes 1000,500 -problem bp1,bp2,bp3,bp4

and it'll loop (in C code) over all the combinations (reusing some 
non-benchmarked things like the DMPlex) across the whole range of sizes, 
problems, devices. It makes one output file and you feed that to a Python 
script to read it as a Pandas DataFrame and plot (or read and interact in a 
notebook). You can have a basket of files from different machines and slice 
those plots without code changes.

We should do similar for a suite of PETSc benchmarks, even just basic Vec and 
Mat operations like in the reports. It isn't more work than a rickety bundle of 
scripts, and it's a lot less error-prone.

Barry Smith  writes:

>   I submit it is actually a good amount of additional work and requires real 
> creativity and very good judgment; it is not a good intro or undergrad 
> project; especially for someone without a huge amount of hands-on experience 
> already. Look who had to do the new SpecHPC multigrid benchmark. The last 
> time I checked Sam was not an undergrad. Senior Scientist, Lawrence Berkeley 
> National Laboratory - ‪‪Cited by 11194‬‬ I definitely do not plan to involve 
> myself in any brand new serious benchmarking studies in my current lifetime, 
> doing one correctly is a massive undertaking IMHO.
>
>> On Jan 22, 2022, at 6:43 PM, Jed Brown  wrote:
>> 
>> This isn't so much more or less work, but work in more useful places. Maybe 
>> this is a good undergrad or intro project to make a clean workflow for these 
>> experiments.
>> 
>> Barry Smith  writes:
>> 
>>>  Performance studies are enormously difficult to do well; which is why 
>>> there are so few good ones out there. And unless you fall into the LINPACK 
>>> benchmark or hit upon Streams the rewards of doing an excellent job are 
>>> pretty thin. Even Streams was not properly maintained for many years, you 
>>> could not just get it and use it out of the box for a variety of purposes 
>>> (which is why PETSc has its hacked-up ones). I submit a properly 
>>> performance study is a full-time job and everyone always has those.
>>> 
 On Jan 22, 2022, at 2:11 PM, Jed Brown  wrote:
 
 Barry Smith  writes:
 
>> On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
>> Barry, when you did the tech reports, did you make an example to 
>> reproduce on other architectures? Like, run this one example (it'll run 
>> all the benchmarks across different sizes) and then run this script on 
>> the output to make all the figures?
> 
>  It is documented in 
> https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may need 
> to dig through the submit scripts etc to find out exactly.
 
 This runs a ton of small jobs and each job doesn't really preload, but 
 instead of loops in job submission scripts, the loops could be inside the 
 C code and it could directly output tabular data. This would run faster 
 and be easier to submit and analyze.
 
 https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf
 
 It would hopefully also avoid writing the size range manually over here in 
 the analysis script where it has to match exactly the job submission.
 
 https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9
 
 
 We'd make our lives a lot easier understanding new machines if we put into 
 the design of performance studies just a fraction of the kind of thought 
 we put into public library interfaces.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith

  I submit it is actually a good amount of additional work and requires real 
creativity and very good judgment; it is not a good intro or undergrad project; 
especially for someone without a huge amount of hands-on experience already. 
Look who had to do the new SpecHPC multigrid benchmark. The last time I checked 
Sam was not an undergrad. Senior Scientist, Lawrence Berkeley National 
Laboratory - ‪‪Cited by 11194‬‬ I definitely do not plan to involve myself in 
any brand new serious benchmarking studies in my current lifetime, doing one 
correctly is a massive undertaking IMHO.

> On Jan 22, 2022, at 6:43 PM, Jed Brown  wrote:
> 
> This isn't so much more or less work, but work in more useful places. Maybe 
> this is a good undergrad or intro project to make a clean workflow for these 
> experiments.
> 
> Barry Smith  writes:
> 
>>  Performance studies are enormously difficult to do well; which is why there 
>> are so few good ones out there. And unless you fall into the LINPACK 
>> benchmark or hit upon Streams the rewards of doing an excellent job are 
>> pretty thin. Even Streams was not properly maintained for many years, you 
>> could not just get it and use it out of the box for a variety of purposes 
>> (which is why PETSc has its hacked-up ones). I submit a properly performance 
>> study is a full-time job and everyone always has those.
>> 
>>> On Jan 22, 2022, at 2:11 PM, Jed Brown  wrote:
>>> 
>>> Barry Smith  writes:
>>> 
> On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
> Barry, when you did the tech reports, did you make an example to 
> reproduce on other architectures? Like, run this one example (it'll run 
> all the benchmarks across different sizes) and then run this script on 
> the output to make all the figures?
 
  It is documented in 
 https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may need 
 to dig through the submit scripts etc to find out exactly.
>>> 
>>> This runs a ton of small jobs and each job doesn't really preload, but 
>>> instead of loops in job submission scripts, the loops could be inside the C 
>>> code and it could directly output tabular data. This would run faster and 
>>> be easier to submit and analyze.
>>> 
>>> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf
>>> 
>>> It would hopefully also avoid writing the size range manually over here in 
>>> the analysis script where it has to match exactly the job submission.
>>> 
>>> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9
>>> 
>>> 
>>> We'd make our lives a lot easier understanding new machines if we put into 
>>> the design of performance studies just a fraction of the kind of thought we 
>>> put into public library interfaces.



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
This isn't so much more or less work, but work in more useful places. Maybe 
this is a good undergrad or intro project to make a clean workflow for these 
experiments.

Barry Smith  writes:

>   Performance studies are enormously difficult to do well; which is why there 
> are so few good ones out there. And unless you fall into the LINPACK 
> benchmark or hit upon Streams the rewards of doing an excellent job are 
> pretty thin. Even Streams was not properly maintained for many years, you 
> could not just get it and use it out of the box for a variety of purposes 
> (which is why PETSc has its hacked-up ones). I submit a properly performance 
> study is a full-time job and everyone always has those.
>
>> On Jan 22, 2022, at 2:11 PM, Jed Brown  wrote:
>> 
>> Barry Smith  writes:
>> 
 On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
 Barry, when you did the tech reports, did you make an example to reproduce 
 on other architectures? Like, run this one example (it'll run all the 
 benchmarks across different sizes) and then run this script on the output 
 to make all the figures?
>>> 
>>>   It is documented in 
>>> https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may need 
>>> to dig through the submit scripts etc to find out exactly.
>> 
>> This runs a ton of small jobs and each job doesn't really preload, but 
>> instead of loops in job submission scripts, the loops could be inside the C 
>> code and it could directly output tabular data. This would run faster and be 
>> easier to submit and analyze.
>> 
>> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf
>> 
>> It would hopefully also avoid writing the size range manually over here in 
>> the analysis script where it has to match exactly the job submission.
>> 
>> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9
>> 
>> 
>> We'd make our lives a lot easier understanding new machines if we put into 
>> the design of performance studies just a fraction of the kind of thought we 
>> put into public library interfaces.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith

   I cleaned up Mark's last run and put it in a fixed-width font. I realize 
this may be too difficult but it would be great to have identical runs to 
compare with on Summit.


   As Jed noted Scatter takes a long time but the pack and unpack take no time? 
Is this not timed if using Kokkos?


--- Event Stage 2: KSP Solve only

MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04 
0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242  0 0.00e+000 
0.00e+00 100
VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04 
0.0e+00  0  0 61 54  0   7  01001000 0  0 0.00e+000 
0.00e+00  0
VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   9  0  0  00 0  0 0.00e+000 
0.00e+00  0
SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000 
0.00e+00  0
SFUnpack 400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  00 0  0 0.00e+000 
0.00e+00  0

KSPSolve   2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04 
1.2e+03  3 60 61 54 60 100100100  73,592   116,796  0 0.00e+000 
0.00e+00 100
VecTDot  802 1.0 1.3551e+00 1.2 3.36e+09 1.0 0.0e+00 0.0e+00 
8.0e+02  0  2  0  0 40  10  3  0  19,62752,599  0 0.00e+000 
0.00e+00 100
VecNorm  402 1.0 9.0151e-01 2.2 1.69e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  0  1  0  0 20   5  1  0  0   14,788   125,477  0 0.00e+000 
0.00e+00 100
VecAXPY  800 1.0 8.2617e-01 1.0 3.36e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   7  3  0  0   32,11261,644  0 0.00e+000 
0.00e+00 100
VecAYPX  398 1.0 8.1525e-01 1.6 1.67e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   5  1  0  0   16,19020,689  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 402 1.0 3.5694e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  1  0  0   18,67538,633  0 0.00e+000 
0.00e+00 100



> On Jan 22, 2022, at 12:40 PM, Mark Adams  wrote:
> 
> And I have a new MR with if you want to see what I've done so far.



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith

  The GPU flop rate (when 100 percent flops on the GPU) should always be higher 
than the overall flop rate (the previous column). For large problems they 
should be similar, for small problems the GPU one may be much higher.

  If the CPU one is higher (when 100 percent flops on the GPU) something must 
be wrong with the logging. I looked at the code for the two cases and didn't 
see anything obvious.

  Junchao and Jacob,
  I think some of the timing code in the Kokkos interface is wrong. 

*  The PetscLogGpuTimeBegin/End should be inside the viewer access code not 
outside it. (The GPU time is an attempt to best time the kernels, not other 
processing around the use of the kernels, that other stuff is captured in the 
general LogEventBegin/End.
*  The use of WaitForKokkos() is confusing and seems inconsistent. 
 -For example it is used in VecTDot_SeqKokkos() which I would think 
has a barrier anyways because it puts a scalar result into update? 
 -Plus PetscLogGpuTimeBegin/End is suppose to already have suitable 
system (that Hong added) to ensure the kernel is complete; reading the manual 
page and looking at Jacobs cupmcontext.hpp it seems to be there so I don't 
think WaitForKokkos() is needed in most places (or is Kokkos asynchronous and 
needs this for correctness?) 
But these won't explain the strange result of overall flop rate being higher 
than GPU flop rate.

  Barry





> On Jan 22, 2022, at 11:44 AM, Mark Adams  wrote:
> 
> I am getting some funny timings and I'm trying to figure it out.  
> I figure the gPU flop rates are bit higher because the timers are inside of 
> the CPU timers, but some are a lot bigger or inverted 
> 
> --- Event Stage 2: KSP Solve only
> 
> MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 
> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000 
> 0.00e+00 100
> MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 
> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000 
> 0.00e+00 100
> SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 
> 8.0e+02  0  2  0  0 40  13  3  0  0 67 69996   488328  0 0.00e+000 
> 0.00e+00 100
> VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000 
> 0.00e+00 100
> VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecAXPY  800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   6  3  0  0  0 247787   448304  0 0.00e+000 
> 0.00e+00 100
> VecAYPX  398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030  0 0.00e+000 
> 0.00e+00 100
> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413  0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 
> 0.0e+00  0  0 62 54  0   5  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  10  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> PCApply  402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138396   262413  0 0.00e+000 
> 0.00e+00 100
> ---
> 
> 
> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang  > wrote:
> 
> 
> 
> On Sat, Jan 22, 2022 at 10:04 AM Mark Adams  > wrote:
> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?
> No, PetscLogGpuTime() does not know the flops of the caller.
>  
> 
> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith  > wrote:
> 
>   Mark,
> 
>   Fix the logging before you run more. It will help with seeing the true 
> disparity between the MatMult and the vector ops.

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith


  Performance studies are enormously difficult to do well; which is why there 
are so few good ones out there. And unless you fall into the LINPACK benchmark 
or hit upon Streams the rewards of doing an excellent job are pretty thin. Even 
Streams was not properly maintained for many years, you could not just get it 
and use it out of the box for a variety of purposes (which is why PETSc has its 
hacked-up ones). I submit a properly performance study is a full-time job and 
everyone always has those.

> On Jan 22, 2022, at 2:11 PM, Jed Brown  wrote:
> 
> Barry Smith  writes:
> 
>>> On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
>>> Barry, when you did the tech reports, did you make an example to reproduce 
>>> on other architectures? Like, run this one example (it'll run all the 
>>> benchmarks across different sizes) and then run this script on the output 
>>> to make all the figures?
>> 
>>   It is documented in 
>> https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may need to 
>> dig through the submit scripts etc to find out exactly.
> 
> This runs a ton of small jobs and each job doesn't really preload, but 
> instead of loops in job submission scripts, the loops could be inside the C 
> code and it could directly output tabular data. This would run faster and be 
> easier to submit and analyze.
> 
> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf
> 
> It would hopefully also avoid writing the size range manually over here in 
> the analysis script where it has to match exactly the job submission.
> 
> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9
> 
> 
> We'd make our lives a lot easier understanding new machines if we put into 
> the design of performance studies just a fraction of the kind of thought we 
> put into public library interfaces.



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jacob Faibussowitsch
> I suggested years ago that -log_view automatically print useful information 
> about the GPU setup (when GPUs are used) but everyone seemed comfortable with 
> the lack of information so no one improved it.

FWIW, PetscDeviceView() does a bit of what you want (it just dumps all of 
cuda/hipDeviceProp)

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Jan 22, 2022, at 12:55, Barry Smith  wrote:
> 
> 
>  I suggested years ago that -log_view automatically print useful information 
> about the GPU setup (when GPUs are used) but everyone seemed comfortable with 
> the lack of information so no one improved it. I think for a small number of 
> GPUs -log_view should just print details and for a larger number print some 
> statistics (how many physical ones etc). Currently, it does not even print 
> how many are used. I think requiring another option to get this basic 
> information is a mistake, we already print a ton of background with -log_view 
> it is just sad no background on the GPU usage.
> 
> 
> 
> 
> 
>> On Jan 22, 2022, at 1:06 PM, Jed Brown  wrote:
>> 
>> Mark Adams  writes:
>> 
>>> On Sat, Jan 22, 2022 at 12:29 PM Jed Brown  wrote:
>>> 
 Mark Adams  writes:
 
>> 
>> 
>> 
>>> VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00
 0.0e+00
>> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608  0 0.00e+00
 0
>> 0.00e+00 100
>>> VecScatterBegin  400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05
 1.6e+04
>> 0.0e+00  0  0 62 54  0   2  0100100  0 0   0  0 0.00e+00
 0
>> 0.00e+00  0
>>> VecScatterEnd400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00
 0.0e+00
>> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0  0 0.00e+00
 0
>> 0.00e+00  0
>>> PCApply  402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00
 0.0e+00
>> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608  0 0.00e+00
 0
>> 0.00e+00 100
>> 
>> Most of the MatMult time is attributed to VecScatterEnd here. Can you
>> share a run of the same total problem size on 8 ranks (one rank per
 GPU)?
>> 
>> 
> attached. I ran out of memory with the same size problem so this is the
> 262K / GPU version.
 
 How was this launched? Is it possible all 8 ranks were using the same GPU?
 (Perf is that bad.)
 
>>> 
>>> srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13
>>> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
>>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
>>> -dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
>>> jacobi -log_view -ksp_view -use_gpu_aware_mpi true
>> 
>> I'm still worried because the results are so unreasonable. We should add an 
>> option like -view_gpu_busid that prints this information per rank.
>> 
>> https://code.ornl.gov/olcf/hello_jobstep/-/blob/master/hello_jobstep.cpp
>> 
>> A single-process/single-GPU comparison would also be a useful point of 
>> comparison.
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
We could create a communicator for the MPI ranks in the first shared-memory 
node, then enumerate their mapping (NUMA and core affinity, and what GPUs they 
see).

Barry Smith  writes:

>   I suggested years ago that -log_view automatically print useful information 
> about the GPU setup (when GPUs are used) but everyone seemed comfortable with 
> the lack of information so no one improved it. I think for a small number of 
> GPUs -log_view should just print details and for a larger number print some 
> statistics (how many physical ones etc). Currently, it does not even print 
> how many are used. I think requiring another option to get this basic 
> information is a mistake, we already print a ton of background with -log_view 
> it is just sad no background on the GPU usage.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Barry Smith  writes:

>> On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
>> Barry, when you did the tech reports, did you make an example to reproduce 
>> on other architectures? Like, run this one example (it'll run all the 
>> benchmarks across different sizes) and then run this script on the output to 
>> make all the figures?
>
>It is documented in 
> https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may need to 
> dig through the submit scripts etc to find out exactly.

This runs a ton of small jobs and each job doesn't really preload, but instead 
of loops in job submission scripts, the loops could be inside the C code and it 
could directly output tabular data. This would run faster and be easier to 
submit and analyze.

https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf

It would hopefully also avoid writing the size range manually over here in the 
analysis script where it has to match exactly the job submission.

https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9


We'd make our lives a lot easier understanding new machines if we put into the 
design of performance studies just a fraction of the kind of thought we put 
into public library interfaces.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith


  I suggested years ago that -log_view automatically print useful information 
about the GPU setup (when GPUs are used) but everyone seemed comfortable with 
the lack of information so no one improved it. I think for a small number of 
GPUs -log_view should just print details and for a larger number print some 
statistics (how many physical ones etc). Currently, it does not even print how 
many are used. I think requiring another option to get this basic information 
is a mistake, we already print a ton of background with -log_view it is just 
sad no background on the GPU usage.





> On Jan 22, 2022, at 1:06 PM, Jed Brown  wrote:
> 
> Mark Adams  writes:
> 
>> On Sat, Jan 22, 2022 at 12:29 PM Jed Brown  wrote:
>> 
>>> Mark Adams  writes:
>>> 
> 
> 
> 
>> VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00
>>> 0.0e+00
> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608  0 0.00e+00
>>> 0
> 0.00e+00 100
>> VecScatterBegin  400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05
>>> 1.6e+04
> 0.0e+00  0  0 62 54  0   2  0100100  0 0   0  0 0.00e+00
>>> 0
> 0.00e+00  0
>> VecScatterEnd400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0  0 0.00e+00
>>> 0
> 0.00e+00  0
>> PCApply  402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00
>>> 0.0e+00
> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608  0 0.00e+00
>>> 0
> 0.00e+00 100
> 
> Most of the MatMult time is attributed to VecScatterEnd here. Can you
> share a run of the same total problem size on 8 ranks (one rank per
>>> GPU)?
> 
> 
 attached. I ran out of memory with the same size problem so this is the
 262K / GPU version.
>>> 
>>> How was this launched? Is it possible all 8 ranks were using the same GPU?
>>> (Perf is that bad.)
>>> 
>> 
>> srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13
>> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
>> -dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
>> jacobi -log_view -ksp_view -use_gpu_aware_mpi true
> 
> I'm still worried because the results are so unreasonable. We should add an 
> option like -view_gpu_busid that prints this information per rank.
> 
> https://code.ornl.gov/olcf/hello_jobstep/-/blob/master/hello_jobstep.cpp
> 
> A single-process/single-GPU comparison would also be a useful point of 
> comparison.



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith



> On Jan 22, 2022, at 12:15 PM, Jed Brown  wrote:
> 
> Mark Adams  writes:
> 
>> as far as streams, does it know to run on the GPU? You don't specify
>> something like -G 1 here for GPUs. I think you just get them all.
> 
> No, this isn't GPU code. BabelStream is a common STREAM suite for different 
> programming models, though I think it doesn't support MPI with GPUs and thus 
> isn't really useful. The code is pretty vanilla. 
> 
> https://github.com/UoB-HPC/BabelStream
> 
> It's very similar to "nstream" in Jeff's PRK
> 
> https://github.com/ParRes/Kernels
> 
> Code is vanilla so I'd expect the results to be much like the corresponding 
> Vec operations.
> 
> Barry, when you did the tech reports, did you make an example to reproduce on 
> other architectures? Like, run this one example (it'll run all the benchmarks 
> across different sizes) and then run this script on the output to make all 
> the figures?

   It is documented in 
https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579You may need to 
dig through the submit scripts etc to find out exactly.







Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams  writes:

> On Sat, Jan 22, 2022 at 12:29 PM Jed Brown  wrote:
>
>> Mark Adams  writes:
>>
>> >>
>> >>
>> >>
>> >> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00
>> 0.0e+00
>> >> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608  0 0.00e+00
>> 0
>> >> 0.00e+00 100
>> >> > VecScatterBegin  400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05
>> 1.6e+04
>> >> 0.0e+00  0  0 62 54  0   2  0100100  0 0   0  0 0.00e+00
>> 0
>> >> 0.00e+00  0
>> >> > VecScatterEnd400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00
>> >> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0  0 0.00e+00
>> 0
>> >> 0.00e+00  0
>> >> > PCApply  402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00
>> 0.0e+00
>> >> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608  0 0.00e+00
>> 0
>> >> 0.00e+00 100
>> >>
>> >> Most of the MatMult time is attributed to VecScatterEnd here. Can you
>> >> share a run of the same total problem size on 8 ranks (one rank per
>> GPU)?
>> >>
>> >>
>> > attached. I ran out of memory with the same size problem so this is the
>> > 262K / GPU version.
>>
>> How was this launched? Is it possible all 8 ranks were using the same GPU?
>> (Perf is that bad.)
>>
>
> srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13
> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
> -dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
> jacobi -log_view -ksp_view -use_gpu_aware_mpi true

I'm still worried because the results are so unreasonable. We should add an 
option like -view_gpu_busid that prints this information per rank.

https://code.ornl.gov/olcf/hello_jobstep/-/blob/master/hello_jobstep.cpp

A single-process/single-GPU comparison would also be a useful point of 
comparison.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
And I have a new MR with if you want to see what I've done so far.


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
On Sat, Jan 22, 2022 at 12:29 PM Jed Brown  wrote:

> Mark Adams  writes:
>
> >>
> >>
> >>
> >> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00
> 0.0e+00
> >> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608  0 0.00e+00
> 0
> >> 0.00e+00 100
> >> > VecScatterBegin  400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05
> 1.6e+04
> >> 0.0e+00  0  0 62 54  0   2  0100100  0 0   0  0 0.00e+00
> 0
> >> 0.00e+00  0
> >> > VecScatterEnd400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0  0 0.00e+00
> 0
> >> 0.00e+00  0
> >> > PCApply  402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00
> 0.0e+00
> >> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608  0 0.00e+00
> 0
> >> 0.00e+00 100
> >>
> >> Most of the MatMult time is attributed to VecScatterEnd here. Can you
> >> share a run of the same total problem size on 8 ranks (one rank per
> GPU)?
> >>
> >>
> > attached. I ran out of memory with the same size problem so this is the
> > 262K / GPU version.
>
> How was this launched? Is it possible all 8 ranks were using the same GPU?
> (Perf is that bad.)
>

srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13
-dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
-dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
jacobi -log_view -ksp_view -use_gpu_aware_mpi true

+ a large .petscrc file


> >> From the other log file (10x bigger problem)
> >>
> >>
> > 
>
> You had attached two files and the difference seemed to be that the second
> was 10x more dofs/rank.
>

I am refining a cube so it goes by 8x.

jac_out_001_kokkos_Crusher_6_1_notpl.txt

number of nodes
number of refinements
number of process per GPU


> > --- Event Stage 2: KSP Solve only
> >
> > MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04
> 0.0e+00  2 55 61 54  0  70 91100100  0 95058   132242  0 0.00e+000
> 0.00e+00 100
> > MatView2 1.0 1.1643e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > KSPSolve   2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04
> 1.2e+03  3 60 61 54 60 100100100100100 73592   116796  0 0.00e+000
> 0.00e+00 100
> > SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > SFUnpack 400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > VecTDot  802 1.0 1.3551e+00 1.2 3.36e+09 1.0 0.0e+00 0.0e+00
> 8.0e+02  0  2  0  0 40  10  3  0  0 67 19627   52599  0 0.00e+000
> 0.00e+00 100
> > VecNorm  402 1.0 9.0151e-01 2.2 1.69e+09 1.0 0.0e+00 0.0e+00
> 4.0e+02  0  1  0  0 20   5  1  0  0 33 14788   125477  0 0.00e+000
> 0.00e+00 100
> > VecCopy4 1.0 7.3905e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > VecSet 4 1.0 3.1814e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > VecAXPY  800 1.0 8.2617e-01 1.0 3.36e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   7  3  0  0  0 32112   61644  0 0.00e+000
> 0.00e+00 100
> > VecAYPX  398 1.0 8.1525e-01 1.6 1.67e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   5  1  0  0  0 16190   20689  0 0.00e+000
> 0.00e+00 100
> > VecPointwiseMult 402 1.0 3.5694e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   3  1  0  0  0 18675   38633  0 0.00e+000
> 0.00e+00 100
> > VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04
> 0.0e+00  0  0 61 54  0   7  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> > VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   9  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > PCApply  402 1.0 3.5712e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   3  1  0  0  0 18665   38633  0 0.00e+000
> 0.00e+00 100
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
So where are we as far as timers?
See the latest examples (with 160 CHARACTER)
Jed, "(I don't trust these timings)." what do you think?

No sense in doing an MR if it is still nonsense.

On Sat, Jan 22, 2022 at 12:16 PM Jed Brown  wrote:

> Mark Adams  writes:
>
> > as far as streams, does it know to run on the GPU? You don't specify
> > something like -G 1 here for GPUs. I think you just get them all.
>
> No, this isn't GPU code. BabelStream is a common STREAM suite for
> different programming models, though I think it doesn't support MPI with
> GPUs and thus isn't really useful. The code is pretty vanilla.
>
> https://github.com/UoB-HPC/BabelStream
>
> It's very similar to "nstream" in Jeff's PRK
>
> https://github.com/ParRes/Kernels
>
> Code is vanilla so I'd expect the results to be much like the
> corresponding Vec operations.
>
> Barry, when you did the tech reports, did you make an example to reproduce
> on other architectures? Like, run this one example (it'll run all the
> benchmarks across different sizes) and then run this script on the output
> to make all the figures?
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams  writes:

>>
>>
>>
>> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608  0 0.00e+000
>> 0.00e+00 100
>> > VecScatterBegin  400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 1.6e+04
>> 0.0e+00  0  0 62 54  0   2  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> > VecScatterEnd400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> > PCApply  402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608  0 0.00e+000
>> 0.00e+00 100
>>
>> Most of the MatMult time is attributed to VecScatterEnd here. Can you
>> share a run of the same total problem size on 8 ranks (one rank per GPU)?
>>
>>
> attached. I ran out of memory with the same size problem so this is the
> 262K / GPU version.

How was this launched? Is it possible all 8 ranks were using the same GPU? 
(Perf is that bad.)

>> From the other log file (10x bigger problem)
>>
>>
> 

You had attached two files and the difference seemed to be that the second was 
10x more dofs/rank.

> --- Event Stage 2: KSP Solve only
>
> MatMult  400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04 8.5e+04 
> 0.0e+00  2 55 61 54  0  70 91100100  0 95058   132242  0 0.00e+000 
> 0.00e+00 100
> MatView2 1.0 1.1643e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> KSPSolve   2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04 8.5e+04 
> 1.2e+03  3 60 61 54 60 100100100100100 73592   116796  0 0.00e+000 
> 0.00e+00 100
> SFPack   400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFUnpack 400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecTDot  802 1.0 1.3551e+00 1.2 3.36e+09 1.0 0.0e+00 0.0e+00 
> 8.0e+02  0  2  0  0 40  10  3  0  0 67 19627   52599  0 0.00e+000 
> 0.00e+00 100
> VecNorm  402 1.0 9.0151e-01 2.2 1.69e+09 1.0 0.0e+00 0.0e+00 
> 4.0e+02  0  1  0  0 20   5  1  0  0 33 14788   125477  0 0.00e+000 
> 0.00e+00 100
> VecCopy4 1.0 7.3905e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecSet 4 1.0 3.1814e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecAXPY  800 1.0 8.2617e-01 1.0 3.36e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   7  3  0  0  0 32112   61644  0 0.00e+000 
> 0.00e+00 100
> VecAYPX  398 1.0 8.1525e-01 1.6 1.67e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   5  1  0  0  0 16190   20689  0 0.00e+000 
> 0.00e+00 100
> VecPointwiseMult 402 1.0 3.5694e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  1  0  0  0 18675   38633  0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04 8.5e+04 
> 0.0e+00  0  0 61 54  0   7  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   9  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> PCApply  402 1.0 3.5712e-01 1.0 8.43e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  1  0  0  0 18665   38633  0 0.00e+000 
> 0.00e+00 100


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
>
>
>
> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608  0 0.00e+000
> 0.00e+00 100
> > VecScatterBegin  400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 1.6e+04
> 0.0e+00  0  0 62 54  0   2  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> > VecScatterEnd400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
> > PCApply  402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608  0 0.00e+000
> 0.00e+00 100
>
> Most of the MatMult time is attributed to VecScatterEnd here. Can you
> share a run of the same total problem size on 8 ranks (one rank per GPU)?
>
>
attached. I ran out of memory with the same size problem so this is the
262K / GPU version.


> From the other log file (10x bigger problem)
>
>

DM Object: box 8 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 274625 274625 274625 274625 274625 274625 274625 
274625
  Number of 1-cells per rank: 811200 811200 811200 811200 811200 811200 811200 
811200
  Number of 2-cells per rank: 798720 798720 798720 798720 798720 798720 798720 
798720
  Number of 3-cells per rank: 262144 262144 262144 262144 262144 262144 262144 
262144
Labels:
  celltype: 4 strata with value/size (0 (274625), 1 (811200), 4 (798720), 7 
(262144))
  depth: 4 strata with value/size (0 (274625), 1 (811200), 2 (798720), 3 
(262144))
  marker: 1 strata with value/size (1 (49530))
  Face Sets: 3 strata with value/size (1 (16129), 3 (16129), 6 (16129))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 8 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 8 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines

*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance Summary: 
--

/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
arch-olcf-crusher named crusher003 with 8 processors, by adams Sat Jan 22 
12:15:11 2022
Using Petsc Development GIT revision: v3.16.3-682-g5f40ebe68c  GIT Date: 
2022-01-22 09:12:56 -0500

 Max   Max/Min Avg   Total
Time (sec):   3.812e+02 1.000   3.812e+02
Objects:  1.990e+03 1.027   1.947e+03
Flop: 1.940e+11 1.027   1.915e+11  1.532e+12
Flop/sec: 5.088e+08 1.027   5.022e+08  4.018e+09
MPI Messages: 4.806e+03 1.066   4.571e+03  3.657e+04
MPI Message Lengths:  4.434e+08 1.015   9.611e+04  3.515e+09
MPI Reductions:   1.991e+03 1.000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
  

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams  writes:

> as far as streams, does it know to run on the GPU? You don't specify
> something like -G 1 here for GPUs. I think you just get them all.

No, this isn't GPU code. BabelStream is a common STREAM suite for different 
programming models, though I think it doesn't support MPI with GPUs and thus 
isn't really useful. The code is pretty vanilla. 

https://github.com/UoB-HPC/BabelStream

It's very similar to "nstream" in Jeff's PRK

https://github.com/ParRes/Kernels

Code is vanilla so I'd expect the results to be much like the corresponding Vec 
operations.

Barry, when you did the tech reports, did you make an example to reproduce on 
other architectures? Like, run this one example (it'll run all the benchmarks 
across different sizes) and then run this script on the output to make all the 
figures?


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
I am getting some funny timings and I'm trying to figure it out.
I figure the gPU flop rates are bit higher because the timers are inside of
the CPU timers, but *some are a lot bigger or inverted*

--- Event Stage 2: KSP Solve only

MatMult  400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04
0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147  0 0.00e+000
0.00e+00 100
MatView2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
KSPSolve   2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04
1.2e+03  2 60 62 54 60 100100100100100 512399   804048  0 0.00e+000
0.00e+00 100
SFPack   400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
VecTDot  802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00
8.0e+02  0  2  0  0 40  13  3  0  0 67 *69996   488328*  0 0.00e+00
 0 0.00e+00 100
VecNorm  402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00
4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507  0 0.00e+000
0.00e+00 100
VecCopy4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
VecAXPY  800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00
0.0e+00  0  2  0  0  0   6  3  0  0  0 *247787   448304*  0 0.00e+00
 0 0.00e+00 100
VecAYPX  398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00
0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030  0 0.00e+000
0.00e+00 100
VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413  0 0.00e+000
0.00e+00 100
VecScatterBegin  400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04
0.0e+00  0  0 62 54  0   5  0100100  0 0   0  0 0.00e+000
0.00e+00  0
VecScatterEnd400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0  10  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
PCApply  402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   2  1  0  0  0 138396   262413  0 0.00e+000
0.00e+00 100
---


On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang 
wrote:

>
>
>
> On Sat, Jan 22, 2022 at 10:04 AM Mark Adams  wrote:
>
>> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?
>>
> No, PetscLogGpuTime() does not know the flops of the caller.
>
>
>>
>> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith  wrote:
>>
>>>
>>>   Mark,
>>>
>>>   Fix the logging before you run more. It will help with seeing the true
>>> disparity between the MatMult and the vector ops.
>>>
>>>
>>> On Jan 21, 2022, at 9:37 PM, Mark Adams  wrote:
>>>
>>> Here is one with 2M / GPU. Getting better.
>>>
>>> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  wrote:
>>>

Matt is correct, vectors are way too small.

BTW: Now would be a good time to run some of the Report I benchmarks
 on Crusher to get a feel for the kernel launch times and performance on
 VecOps.

Also Report 2.

   Barry


 On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:

 On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:

> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
> MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware
> MPI are similar (mat-vec is a little faster w/o, the total is about the
> same, call it noise)
>
> I found that MatMult was about 3x faster using 8 cores/GPU, that is
> all 64 cores on the node, then when using 1 core/GPU. With the same size
> problem of course.
> I was thinking MatMult should be faster with just one MPI process. Oh
> well, worry about that later.
>
> The bigger problem, and I have observed this to some extent with the
> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations 
> are
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated
> by not-mat-vec:
>
>
> 
> EventCount  Time (sec)

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
On Sat, Jan 22, 2022 at 10:25 AM Jed Brown  wrote:

> Mark Adams  writes:
>
> > On Fri, Jan 21, 2022 at 9:55 PM Barry Smith  wrote:
> >
> >>
> >> Interesting, Is this with all native Kokkos kernels or do some kokkos
> >> kernels use rocm?
> >>
> >
> > Ah, good question. I often run with tpl=0 but I did not specify here on
> > Crusher. In looking at the log files I see
> >
> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-crusher/externalpackages/git.kokkos-kernels/src/impl/tpls
> >
> > Here is a run with tpls turned off. These tpl includes are gone.
> >
> > It looks pretty much the same. A little slower but that could be noise.
>
> >
> 
> > *** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document***
> >
> 
>
> We gotta say 160 chars because that's what we use now.
>
>
done

as far as streams, does it know to run on the GPU? You don't specify
something like -G 1 here for GPUs. I think you just get them all.


11:14 adams/aijkokkos-gpu-logging=
crusher:/gpfs/alpine/csc314/scratch/adams/petsc$ make
PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc
PETSC_ARCH=arch-olcf-crusher streams
cc -o MPIVersion.o -c -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing
-Wno-unknown-pragmas -fstack-protector -Qunused-arguments
-fvisibility=hidden -g -O3
 -I/gpfs/alpine/csc314/scratch/adams/petsc/include
-I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-crusher/include
-I/opt/rocm-4.5.0/include`pwd`/MPIVersion.c
Running streams with '/usr/bin/srun -p batch -N 1 -A csc314_crusher -t
00:10:00 ' using 'NPMAX=128'
1  53355.9207   Rate (MB/s)
2  39565.2208   Rate (MB/s) 0.741534
3  34538.3431   Rate (MB/s) 0.64732
4  32469.3375   Rate (MB/s) 0.608543
5  31041.1569   Rate (MB/s) 0.581776
6  30113.3826   Rate (MB/s) 0.564387
7  29562.5285   Rate (MB/s) 0.554063
8  29228.8090   Rate (MB/s) 0.547808
9  31474.3616   Rate (MB/s) 0.589895
10  31306.7647   Rate (MB/s) 0.586754
11  31147.4674   Rate (MB/s) 0.583768
12  31006.5008   Rate (MB/s) 0.581126
13  30859.4559   Rate (MB/s) 0.57837
14  30796.0587   Rate (MB/s) 0.577182
15  30604.4849   Rate (MB/s) 0.573591
16  30565.4340   Rate (MB/s) 0.572859
17  32421.9349   Rate (MB/s) 0.607654
18  34365.3424   Rate (MB/s) 0.644078
19  36289.4518   Rate (MB/s) 0.680139
20  38194.5300   Rate (MB/s) 0.715845
21  40160.4660   Rate (MB/s) 0.75269
22  42062.3931   Rate (MB/s) 0.788336
23  43890.2036   Rate (MB/s) 0.822593
24  45775.4680   Rate (MB/s) 0.857927
25  47708.8770   Rate (MB/s) 0.894163
26  49559.6810   Rate (MB/s) 0.928851
27  51457.5537   Rate (MB/s) 0.964421
28  53528.3420   Rate (MB/s) 1.00323


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Junchao Zhang
On Sat, Jan 22, 2022 at 10:04 AM Mark Adams  wrote:

> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?
>
No, PetscLogGpuTime() does not know the flops of the caller.


>
> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith  wrote:
>
>>
>>   Mark,
>>
>>   Fix the logging before you run more. It will help with seeing the true
>> disparity between the MatMult and the vector ops.
>>
>>
>> On Jan 21, 2022, at 9:37 PM, Mark Adams  wrote:
>>
>> Here is one with 2M / GPU. Getting better.
>>
>> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  wrote:
>>
>>>
>>>Matt is correct, vectors are way too small.
>>>
>>>BTW: Now would be a good time to run some of the Report I benchmarks
>>> on Crusher to get a feel for the kernel launch times and performance on
>>> VecOps.
>>>
>>>Also Report 2.
>>>
>>>   Barry
>>>
>>>
>>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:
>>>
>>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:
>>>
 I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
 (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
 MI200?).
 This is with a 16M equation problem. GPU-aware MPI and non GPU-aware
 MPI are similar (mat-vec is a little faster w/o, the total is about the
 same, call it noise)

 I found that MatMult was about 3x faster using 8 cores/GPU, that is all
 64 cores on the node, then when using 1 core/GPU. With the same size
 problem of course.
 I was thinking MatMult should be faster with just one MPI process. Oh
 well, worry about that later.

 The bigger problem, and I have observed this to some extent with the
 Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
 expensive or crazy expensive.
 You can see (attached) and the times here that the solve is dominated
 by not-mat-vec:


 
 EventCount  Time (sec) Flop
  --- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -
   - GpuToCpu - GPU
Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
   Count   Size  %F

 ---
 17:15 main=
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult
  400" jac_out_00*5_8_gpuawaremp*
 MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
 0.00e+000 0.00e+00 100
 17:15 main=
 /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve
   2" jac_out_001*_5_8_gpuawaremp*
 KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*
  0 0.00e+000 0.00e+00 100

 Notes about flop counters here,
 * that MatMult flops are not logged as GPU flops but something is
 logged nonetheless.
 * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
 * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
 are at < 1%.

>>>
>>> This looks complicated, so just a single remark:
>>>
>>> My understanding of the benchmarking of vector ops led by Hannah was
>>> that you needed to be much
>>> bigger than 16M to hit peak. I need to get the tech report, but on 8
>>> GPUs I would think you would be
>>> at 10% of peak or something right off the bat at these sizes. Barry, is
>>> that right?
>>>
>>>   Thanks,
>>>
>>>  Matt
>>>
>>>
 Anway, not sure how to proceed but I thought I would share.
 Maybe ask the Kokkos guys if the have looked at Crusher.

 Mark

>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> 
>>>
>>>
>>> 
>>
>>
>>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?

On Fri, Jan 21, 2022 at 9:47 PM Barry Smith  wrote:

>
>   Mark,
>
>   Fix the logging before you run more. It will help with seeing the true
> disparity between the MatMult and the vector ops.
>
>
> On Jan 21, 2022, at 9:37 PM, Mark Adams  wrote:
>
> Here is one with 2M / GPU. Getting better.
>
> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  wrote:
>
>>
>>Matt is correct, vectors are way too small.
>>
>>BTW: Now would be a good time to run some of the Report I benchmarks
>> on Crusher to get a feel for the kernel launch times and performance on
>> VecOps.
>>
>>Also Report 2.
>>
>>   Barry
>>
>>
>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:
>>
>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:
>>
>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>>> MI200?).
>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
>>> are similar (mat-vec is a little faster w/o, the total is about the same,
>>> call it noise)
>>>
>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>>> 64 cores on the node, then when using 1 core/GPU. With the same size
>>> problem of course.
>>> I was thinking MatMult should be faster with just one MPI process. Oh
>>> well, worry about that later.
>>>
>>> The bigger problem, and I have observed this to some extent with the
>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>>> expensive or crazy expensive.
>>> You can see (attached) and the times here that the solve is dominated by
>>> not-mat-vec:
>>>
>>>
>>> 
>>> EventCount  Time (sec) Flop
>>>  --- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -
>>>   - GpuToCpu - GPU
>>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>>>   Count   Size  %F
>>>
>>> ---
>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>>> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
>>> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
>>> 0.00e+000 0.00e+00 100
>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>>> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
>>> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*
>>>  0 0.00e+000 0.00e+00 100
>>>
>>> Notes about flop counters here,
>>> * that MatMult flops are not logged as GPU flops but something is logged
>>> nonetheless.
>>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>>> are at < 1%.
>>>
>>
>> This looks complicated, so just a single remark:
>>
>> My understanding of the benchmarking of vector ops led by Hannah was that
>> you needed to be much
>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs
>> I would think you would be
>> at 10% of peak or something right off the bat at these sizes. Barry, is
>> that right?
>>
>>   Thanks,
>>
>>  Matt
>>
>>
>>> Anway, not sure how to proceed but I thought I would share.
>>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>>
>>> Mark
>>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> 
>>
>>
>> 
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams  writes:

> On Fri, Jan 21, 2022 at 9:55 PM Barry Smith  wrote:
>
>>
>> Interesting, Is this with all native Kokkos kernels or do some kokkos
>> kernels use rocm?
>>
>
> Ah, good question. I often run with tpl=0 but I did not specify here on
> Crusher. In looking at the log files I see
> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-crusher/externalpackages/git.kokkos-kernels/src/impl/tpls
>
> Here is a run with tpls turned off. These tpl includes are gone.
>
> It looks pretty much the same. A little slower but that could be noise.

> 
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
> -fCourier9' to print this document***
> 

We gotta say 160 chars because that's what we use now.

> -- PETSc Performance Summary: 
> --
>
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
> arch-olcf-crusher named crusher001 with 64 processors, by adams Fri Jan 21 
> 23:48:31 2022
> Using Petsc Development GIT revision: v3.16.3-665-g1012189b9a  GIT Date: 
> 2022-01-21 16:28:20 +
>
>  Max   Max/Min Avg   Total
> Time (sec):   7.919e+01 1.000   7.918e+01
> Objects:  2.088e+03 1.164   1.852e+03
> Flop: 2.448e+10 1.074   2.393e+10  1.532e+12
> Flop/sec: 3.091e+08 1.074   3.023e+08  1.935e+10
> MPI Messages: 1.651e+04 3.673   9.388e+03  6.009e+05
> MPI Message Lengths:  2.278e+08 2.093   1.788e+04  1.074e+10
> MPI Reductions:   1.988e+03 1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N --> 
> 2N flop
> and VecAXPY() for complex vectors of length N --> 
> 8N flop
>
> Summary of Stages:   - Time --  - Flop --  --- Messages ---  
> -- Message Lengths --  -- Reductions --
> Avg %Total Avg %TotalCount   %Total   
>   Avg %TotalCount   %Total
>  0:  Main Stage: 7.4289e+01  93.8%  6.0889e+11  39.8%  2.265e+05  37.7%  
> 2.175e+04   45.8%  7.630e+02  38.4%
>  1: PCSetUp: 3.1604e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
> 0.000e+000.0%  0.000e+00   0.0%
>  2:  KSP Solve only: 4.8576e+00   6.1%  9.2287e+11  60.2%  3.744e+05  62.3%  
> 1.554e+04   54.2%  1.206e+03  60.7%
>
> 
> See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
> Phase summary info:
>Count: number of times phase was executed
>Time and Flop: Max - maximum over all processors
>   Ratio - ratio of maximum to minimum over all processors
>Mess: number of messages sent
>AvgLen: average message length (bytes)
>Reduct: number of global reductions
>Global: entire computation
>Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>   %T - percent time in this phase %F - percent flop in this phase
>   %M - percent messages in this phase %L - percent message lengths in 
> this phase
>   %R - percent reductions in this phase
>Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
> all processors)
>GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU 
> time over all processors)
>CpuToGpu Count: total number of CPU to GPU copies per processor
>CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
> processor)
>GpuToCpu Count: total number of GPU to CPU copies per processor
>GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
> processor)
>GPU %F: percent flops on GPU in this event
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
>
> --- Event Stage 0: Main Stage
>
> PetscBarrier   5 1.0 2.0665e-01 1.1 0.00e+00 0.0 1.1e+04 8.0e+02 
> 1.8e+01  0  0  2  0  1   0  0  5  0  2 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
I should be able to add this profiling now.

On Fri, Jan 21, 2022 at 10:48 PM Junchao Zhang 
wrote:

>
>
>
> On Fri, Jan 21, 2022 at 8:08 PM Barry Smith  wrote:
>
>>
>>   Junchao, Mark,
>>
>>  Some of the logging information is non-sensible, MatMult says all
>> flops are done on the GPU (last column) but the GPU flop rate is zero.
>>
>>  It looks like  MatMult_SeqAIJKokkos() is missing
>> PetscLogGpuTimeBegin()/End() in fact all the operations in
>> aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU
>> flop rate. Can this be fixed ASAP?
>>
> I will add this profiling temporarily.  I may use Kokkos own profiling
> APIs later.
>
>
>>
>>  Regarding VecOps, sure looks the kernel launches are killing
>> performance.
>>
>>But in particular look at the VecTDot and VecNorm CPU flop
>> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is
>> likely hurting performance in there also a great deal. It would be good to
>> see a single MPI rank job to compare to see performance without the MPI
>> overhead.
>>
>>
>>
>>
>>
>>
>>
>> On Jan 21, 2022, at 6:41 PM, Mark Adams  wrote:
>>
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>> MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
>> are similar (mat-vec is a little faster w/o, the total is about the same,
>> call it noise)
>>
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>> 64 cores on the node, then when using 1 core/GPU. With the same size
>> problem of course.
>> I was thinking MatMult should be faster with just one MPI process. Oh
>> well, worry about that later.
>>
>> The bigger problem, and I have observed this to some extent with the
>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by
>> not-mat-vec:
>>
>>
>> 
>> EventCount  Time (sec) Flop
>>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
>> GpuToCpu - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>> Count   Size  %F
>>
>> ---
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
>> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
>> 0.00e+000 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
>> 0.00e+000 0.00e+00 100
>>
>> Notes about flop counters here,
>> * that MatMult flops are not logged as GPU flops but something is logged
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>> are at < 1%.
>>
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>
>> Mark
>>
>>
>> 
>>
>>
>>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
On Fri, Jan 21, 2022 at 9:55 PM Barry Smith  wrote:

>
> Interesting, Is this with all native Kokkos kernels or do some kokkos
> kernels use rocm?
>

Ah, good question. I often run with tpl=0 but I did not specify here on
Crusher. In looking at the log files I see
-I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-crusher/externalpackages/git.kokkos-kernels/src/impl/tpls

Here is a run with tpls turned off. These tpl includes are gone.

It looks pretty much the same. A little slower but that could be noise.
DM Object: box 64 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 64 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 64 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 64 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 64 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 64 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 64 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 64 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 64 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 64 MPI processes
type: mpiaijkokkos
rows=16581375, cols=16581375
total: nonzeros=1045678375, allocated nonzeros=1045678375
total number of mallocs used during MatSetValues calls=0
  not using I-node (on process 0) routines

*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Junchao Zhang
On Fri, Jan 21, 2022 at 8:08 PM Barry Smith  wrote:

>
>   Junchao, Mark,
>
>  Some of the logging information is non-sensible, MatMult says all
> flops are done on the GPU (last column) but the GPU flop rate is zero.
>
>  It looks like  MatMult_SeqAIJKokkos() is missing
> PetscLogGpuTimeBegin()/End() in fact all the operations in
> aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU
> flop rate. Can this be fixed ASAP?
>
I will add this profiling temporarily.  I may use Kokkos own profiling APIs
later.


>
>  Regarding VecOps, sure looks the kernel launches are killing
> performance.
>
>But in particular look at the VecTDot and VecNorm CPU flop
> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is
> likely hurting performance in there also a great deal. It would be good to
> see a single MPI rank job to compare to see performance without the MPI
> overhead.
>
>
>
>
>
>
>
> On Jan 21, 2022, at 6:41 PM, Mark Adams  wrote:
>
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
> MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
> are similar (mat-vec is a little faster w/o, the total is about the same,
> call it noise)
>
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
> cores on the node, then when using 1 core/GPU. With the same size problem
> of course.
> I was thinking MatMult should be faster with just one MPI process. Oh
> well, worry about that later.
>
> The bigger problem, and I have observed this to some extent with the
> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by
> not-mat-vec:
>
>
> 
> EventCount  Time (sec) Flop
>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
> GpuToCpu - GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
> Count   Size  %F
>
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
> 0.00e+000 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
> 0.00e+000 0.00e+00 100
>
> Notes about flop counters here,
> * that MatMult flops are not logged as GPU flops but something is logged
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
> at < 1%.
>
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
>
> Mark
>
>
> 
>
>
>


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

Interesting, Is this with all native Kokkos kernels or do some kokkos kernels 
use rocm? 

I ask because VecNorm is 4 times higher than VecDot, I would not expect that 
and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect 
that)


MatMult  400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 
0.0e+00  0 54  0  0  0  43 91  0  0  0 98964   0  0 0.00e+000 
0.00e+00 100
MatView2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
KSPSolve   2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 
0.0e+00  1 60  0  0  0 100100  0  0  0 46887   220,001  0 0.00e+000 
0.00e+00 100
VecTDot  802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0  20  3  0  0  0  6882   15,426  0 0.00e+000 
0.00e+00 100
VecNorm  402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   5  1  0  0  0 14281   62,757  0 0.00e+000 
0.00e+00 100
VecCopy4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecSet 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecAXPY  800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0  23  3  0  0  0  5880   14,666  0 0.00e+000 
0.00e+00 100
VecAYPX  398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   4  1  0  0  0 15284   71,218  0 0.00e+000 
0.00e+00 100
VecPointwiseMult 402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  1  0  0  0  7534   33,579  0 0.00e+000 
0.00e+00 100
PCApply  402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  1  0  0  0  7527   33,579  0 0.00e+000 
0.00e+00 100



> On Jan 21, 2022, at 9:46 PM, Mark Adams  wrote:
> 
> 
>But in particular look at the VecTDot and VecNorm CPU flop rates 
> compared to the GPU, much lower, this tells me the MPI_Allreduce is likely 
> hurting performance in there also a great deal. It would be good to see a 
> single MPI rank job to compare to see performance without the MPI overhead.
> 
> Here are two single processor runs, with a whole GPU. It's not clear of 
> --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).
>  
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

  Mark,

  Fix the logging before you run more. It will help with seeing the true 
disparity between the MatMult and the vector ops.


> On Jan 21, 2022, at 9:37 PM, Mark Adams  wrote:
> 
> Here is one with 2M / GPU. Getting better.
> 
> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  > wrote:
> 
>Matt is correct, vectors are way too small.
> 
>BTW: Now would be a good time to run some of the Report I benchmarks on 
> Crusher to get a feel for the kernel launch times and performance on VecOps.
> 
>Also Report 2.
> 
>   Barry
> 
> 
>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley > > wrote:
>> 
>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams > > wrote:
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
>> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
>> similar (mat-vec is a little faster w/o, the total is about the same, call 
>> it noise)
>> 
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
>> cores on the node, then when using 1 core/GPU. With the same size problem of 
>> course.
>> I was thinking MatMult should be faster with just one MPI process. Oh well, 
>> worry about that later.
>> 
>> The bigger problem, and I have observed this to some extent with the Landau 
>> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by 
>> not-mat-vec:
>> 
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
>> MatMult  400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
>> 0.0e+00  1 55 62 54  0  27 91100100  0 668874   0  0 0.00e+000 
>> 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve   2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
>> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405  0 0.00e+000 
>> 0.00e+00 100
>> 
>> Notes about flop counters here, 
>> * that MatMult flops are not logged as GPU flops but something is logged 
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are 
>> at < 1%.
>> 
>> This looks complicated, so just a single remark:
>> 
>> My understanding of the benchmarking of vector ops led by Hannah was that 
>> you needed to be much
>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I 
>> would think you would be
>> at 10% of peak or something right off the bat at these sizes. Barry, is that 
>> right?
>> 
>>   Thanks,
>> 
>>  Matt
>>  
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>> 
>> Mark
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ 
> 
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
>
>
>But in particular look at the VecTDot and VecNorm CPU flop
> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is
> likely hurting performance in there also a great deal. It would be good to
> see a single MPI rank job to compare to see performance without the MPI
> overhead.
>

Here are two single processor runs, with a whole GPU. It's not clear
of --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).
DM Object: box 1 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937
  Number of 1-cells per rank: 104544
  Number of 2-cells per rank: 101376
  Number of 3-cells per rank: 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (24480))
  Face Sets: 6 strata with value/size (6 (3600), 5 (3600), 3 (3600), 4 (3600), 
1 (3600), 2 (3600))
  Linear solve converged due to CONVERGED_RTOL iterations 122
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijkokkos
rows=250047, cols=250047
total: nonzeros=15069223, allocated nonzeros=15069223
total number of mallocs used during MatSetValues calls=0
  not using I-node routines
  Linear solve converged due to CONVERGED_RTOL iterations 122
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijkokkos
rows=250047, cols=250047
total: nonzeros=15069223, allocated nonzeros=15069223
total number of mallocs used during MatSetValues calls=0
  not using I-node routines
  Linear solve converged due to CONVERGED_RTOL iterations 122
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijkokkos
rows=250047, cols=250047
total: nonzeros=15069223, allocated nonzeros=15069223
total number of mallocs used during MatSetValues calls=0
  not using I-node routines

*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance Summary: 
--

/gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a 
arch-olcf-crusher named crusher003 with 1 processor, by adams Fri Jan 21 
21:30:02 2022
Using Petsc Development GIT revision: v3.16.3-665-g1012189b9a  GIT Date: 
2022-01-21 16:28:20 +

 Max   Max/Min Avg   Total
Time (sec):   5.916e+01 1.000   5.916e+01
Objects:  1.637e+03 1.000   1.637e+03
Flop: 1.454e+10 1.000   1.454e+10  1.454e+10
Flop/sec: 2.459e+08 1.000   2.459e+08  2.459e+08
MPI Messages: 0.000e+00 0.000   0.000e+00  0.000e+00
MPI Message Lengths:  1.800e+01 1.000   0.000e+00  1.800e+01
MPI Reductions:   9.000e+00 1.000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N 
flop
and VecAXPY() for complex vectors of length N --> 
8N flop

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 5.8503e+01  98.9%  6.3978e+09  44.0%  0.000e+00   0.0%  
0.000e+00  100.0%  9.000e+00 100.0%
 1: PCSetUp: 2.0318e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  KSP Solve only: 6.3347e-01   1.1%  8.1469e+09  56.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%


Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
Here is one with 2M / GPU. Getting better.

On Fri, Jan 21, 2022 at 9:17 PM Barry Smith  wrote:

>
>Matt is correct, vectors are way too small.
>
>BTW: Now would be a good time to run some of the Report I benchmarks on
> Crusher to get a feel for the kernel launch times and performance on VecOps.
>
>Also Report 2.
>
>   Barry
>
>
> On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:
>
> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:
>
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>> MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
>> are similar (mat-vec is a little faster w/o, the total is about the same,
>> call it noise)
>>
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>> 64 cores on the node, then when using 1 core/GPU. With the same size
>> problem of course.
>> I was thinking MatMult should be faster with just one MPI process. Oh
>> well, worry about that later.
>>
>> The bigger problem, and I have observed this to some extent with the
>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by
>> not-mat-vec:
>>
>>
>> 
>> EventCount  Time (sec) Flop
>>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
>> GpuToCpu - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>> Count   Size  %F
>>
>> ---
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
>> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
>> 0.00e+000 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
>> 0.00e+000 0.00e+00 100
>>
>> Notes about flop counters here,
>> * that MatMult flops are not logged as GPU flops but something is logged
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>> are at < 1%.
>>
>
> This looks complicated, so just a single remark:
>
> My understanding of the benchmarking of vector ops led by Hannah was that
> you needed to be much
> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs
> I would think you would be
> at 10% of peak or something right off the bat at these sizes. Barry, is
> that right?
>
>   Thanks,
>
>  Matt
>
>
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>
>> Mark
>>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>
>
>
DM Object: box 64 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 274625 
274625 274625
  Number of 1-cells per rank: 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 811200 
811200 811200
  Number of 2-cells per rank: 798720 798720 798720 798720 798720 798720 798720 
798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 
798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 798720 
798720 798720 798720 798720 798720 798720 798720 798720 798720 

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

   Matt is correct, vectors are way too small.

   BTW: Now would be a good time to run some of the Report I benchmarks on 
Crusher to get a feel for the kernel launch times and performance on VecOps.

   Also Report 2.

  Barry


> On Jan 21, 2022, at 7:58 PM, Matthew Knepley  wrote:
> 
> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  > wrote:
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
> similar (mat-vec is a little faster w/o, the total is about the same, call it 
> noise)
> 
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
> cores on the node, then when using 1 core/GPU. With the same size problem of 
> course.
> I was thinking MatMult should be faster with just one MPI process. Oh well, 
> worry about that later.
> 
> The bigger problem, and I have observed this to some extent with the Landau 
> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by 
> not-mat-vec:
> 
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
> 0.0e+00  1 55 62 54  0  27 91100100  0 668874   0  0 0.00e+000 
> 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405  0 0.00e+000 
> 0.00e+00 100
> 
> Notes about flop counters here, 
> * that MatMult flops are not logged as GPU flops but something is logged 
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at 
> < 1%.
> 
> This looks complicated, so just a single remark:
> 
> My understanding of the benchmarking of vector ops led by Hannah was that you 
> needed to be much
> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I 
> would think you would be
> at 10% of peak or something right off the bat at these sizes. Barry, is that 
> right?
> 
>   Thanks,
> 
>  Matt
>  
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
> 
> Mark
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith

  Junchao, Mark,

 Some of the logging information is non-sensible, MatMult says all flops 
are done on the GPU (last column) but the GPU flop rate is zero. 

 It looks like  MatMult_SeqAIJKokkos() is missing 
PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx 
seem to be missing it. This might explain the crazy 0 GPU flop rate. Can this 
be fixed ASAP?

 Regarding VecOps, sure looks the kernel launches are killing performance. 

   But in particular look at the VecTDot and VecNorm CPU flop rates 
compared to the GPU, much lower, this tells me the MPI_Allreduce is likely 
hurting performance in there also a great deal. It would be good to see a 
single MPI rank job to compare to see performance without the MPI overhead.







> On Jan 21, 2022, at 6:41 PM, Mark Adams  wrote:
> 
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
> similar (mat-vec is a little faster w/o, the total is about the same, call it 
> noise)
> 
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
> cores on the node, then when using 1 core/GPU. With the same size problem of 
> course.
> I was thinking MatMult should be faster with just one MPI process. Oh well, 
> worry about that later.
> 
> The bigger problem, and I have observed this to some extent with the Landau 
> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by 
> not-mat-vec:
> 
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
> 0.0e+00  1 55 62 54  0  27 91100100  0 668874   0  0 0.00e+000 
> 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405  0 0.00e+000 
> 0.00e+00 100
> 
> Notes about flop counters here, 
> * that MatMult flops are not logged as GPU flops but something is logged 
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at 
> < 1%.
> 
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
> 
> Mark
> 
> 
> 



Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Matthew Knepley
On Fri, Jan 21, 2022 at 6:41 PM Mark Adams  wrote:

> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
> MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
> are similar (mat-vec is a little faster w/o, the total is about the same,
> call it noise)
>
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
> cores on the node, then when using 1 core/GPU. With the same size problem
> of course.
> I was thinking MatMult should be faster with just one MPI process. Oh
> well, worry about that later.
>
> The bigger problem, and I have observed this to some extent with the
> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by
> not-mat-vec:
>
>
> 
> EventCount  Time (sec) Flop
>--- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
> GpuToCpu - GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
> Count   Size  %F
>
> ---
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
> MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0
> 0.00e+000 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0
> 0.00e+000 0.00e+00 100
>
> Notes about flop counters here,
> * that MatMult flops are not logged as GPU flops but something is logged
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
> at < 1%.
>

This looks complicated, so just a single remark:

My understanding of the benchmarking of vector ops led by Hannah was that
you needed to be much
bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I
would think you would be
at 10% of peak or something right off the bat at these sizes. Barry, is
that right?

  Thanks,

 Matt


> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
>
> Mark
>
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


[petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
(ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
MI200?).
This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
are similar (mat-vec is a little faster w/o, the total is about the same,
call it noise)

I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
cores on the node, then when using 1 core/GPU. With the same size problem
of course.
I was thinking MatMult should be faster with just one MPI process. Oh well,
worry about that later.

The bigger problem, and I have observed this to some extent with the Landau
TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
expensive or crazy expensive.
You can see (attached) and the times here that the solve is dominated by
not-mat-vec:


EventCount  Time (sec) Flop
 --- Global ---  --- Stage   *Total   GPU *   - CpuToGpu -   -
GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
Count   Size  %F
---
17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
grep "MatMult  400" jac_out_00*5_8_gpuawaremp*
MatMult  400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05 1.6e+04
0.0e+00  1 55 62 54  0  27 91100100  0 *668874   0*  0 0.00e+00
 0 0.00e+00 100
17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
grep "KSPSolve   2" jac_out_001*_5_8_gpuawaremp*
KSPSolve   2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05 1.6e+04
1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*  0 0.00e+00
 0 0.00e+00 100

Notes about flop counters here,
* that MatMult flops are not logged as GPU flops but something is logged
nonetheless.
* The GPU flop rate is 5x the total flop rate  in KSPSolve :\
* I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
at < 1%.

Anway, not sure how to proceed but I thought I would share.
Maybe ask the Kokkos guys if the have looked at Crusher.

Mark
DM Object: box 64 MPI processes
  type: plex
box in 3 dimensions:
  Number of 0-cells per rank: 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 35937 
35937 35937 35937 35937
  Number of 1-cells per rank: 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 104544 
104544 104544
  Number of 2-cells per rank: 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 101376 
101376 101376
  Number of 3-cells per rank: 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 32768 
32768 32768 32768 32768
Labels:
  celltype: 4 strata with value/size (0 (35937), 1 (104544), 4 (101376), 7 
(32768))
  depth: 4 strata with value/size (0 (35937), 1 (104544), 2 (101376), 3 (32768))
  marker: 1 strata with value/size (1 (12474))
  Face Sets: 3 strata with value/size (1 (3969), 3 (3969), 6 (3969))
  Linear solve did not converge due to DIVERGED_ITS iterations 200
KSP Object: 64 MPI processes
  type: cg
  maximum iterations=200, initial guess is zero
  tolerances:  relative=1e-12, absolute=1e-50, divergence=1.
  left preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 64 MPI processes
  type: jacobi
type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 64 MPI processes

  1   2   >