Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Smith, Barry F. via petsc-dev


git branch  --contains barry/2019-09-01/robustify-version-check
  balay/jed-gitlab-ci
  
  master


  Make a new branch from your current branch, add like -feature-sf-on-gpu to 
the end of the name and merge in jczhang/feature-sf-on-gpu 

   configure and test with that.

   Barry


> On Sep 1, 2019, at 9:50 AM, Mark Adams  wrote:
> 
> Junchao and Barry, 
> 
> I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's 
> robustify branch. Is this in master yet? If so, I'd like to get my branch 
> merged to master, then merge Junchao's branch. Then us it.
> 
> I think we were waiting for some refactoring from Karl to proceed.
> 
> Anyway, I'm not sure how to proceed.
> 
> Thanks,
> Mark
> 
> 
> On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao  wrote:
> 
> 
> 
> On Sat, Aug 31, 2019 at 8:04 PM Mark Adams  wrote:
> 
> 
> On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F.  wrote:
> 
>   Any explanation for why the scaling is much better for CPUs and than GPUs? 
> Is it the "extra" time needed for communication from the GPUs? 
> 
> The GPU work is well load balanced so it weak scales perfectly. When you put 
> that work in the CPU you get more perfectly scalable work added so it looks 
> better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from 
> the 1 node to 512 node case for both GPU and CPU, because this non-scaling is 
> from communication that is the same for both cases
>  
> 
>   Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA 
> branch (in the gitlab merge requests)  that can speed up the communication 
> from GPUs?
> 
> Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?
> 
> Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then add 
> -use_gpu_aware_mpi in option to let PETSc use that feature.
>  
>  
> 
>Barry
> 
> 
> > On Aug 30, 2019, at 11:56 AM, Mark Adams  wrote:
> > 
> > Here is some more weak scaling data with a fixed number of iterations (I 
> > have given a test with the numerical problems to ORNL and they said they 
> > would give it to Nvidia).
> > 
> > I implemented an option to "spread" the reduced coarse grids across the 
> > whole machine as opposed to a "compact" layout where active processes are 
> > laid out in simple lexicographical order. This spread approach looks a 
> > little better.
> > 
> > Mark
> > 
> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F.  wrote:
> > 
> >   Ahh, PGI compiler, that explains it :-)
> > 
> >   Ok, thanks. Don't worry about the runs right now. We'll figure out the 
> > fix. The code is just
> > 
> >   *a = (PetscReal)strtod(name,endptr);
> > 
> >   could be a compiler bug.
> > 
> > 
> > 
> > 
> > > On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
> > > 
> > > I am getting this error with single:
> > > 
> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 
> > > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type 
> > > aijcusparse -fp_trap 
> > > [0] 81 global equations, 27 vertices
> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > > [0]PETSC ERROR: The specific exception can be determined by running in a 
> > > debugger.  When the
> > > [0]PETSC ERROR: debugger traps the signal, the exception can be found 
> > > with fetestexcept(0x3e00)
> > > [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 
> > > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> > > [0]PETSC ERROR: Try option -start_in_debugger
> > > [0]PETSC ERROR: likely location of problem given in stack below
> > > [0]PETSC ERROR: -  Stack Frames 
> > > 
> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
> > > available,
> > > [0]PETSC ERROR:   INSTEAD the line number of the start of the function
> > > [0]PETSC ERROR:   is given.
> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > > [0]PETSC ERROR: [0] PetscStrtod line 1964 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 
> > > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > > [0]PETSC ERROR: - Error Message 
> > > --
> > > [0]PETSC ERROR: Floating point 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Mark Adams via petsc-dev
Junchao and Barry,

I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's
robustify branch. Is this in master yet? If so, I'd like to get my branch
merged to master, then merge Junchao's branch. Then us it.

I think we were waiting for some refactoring from Karl to proceed.

Anyway, I'm not sure how to proceed.

Thanks,
Mark


On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao  wrote:

>
>
>
> On Sat, Aug 31, 2019 at 8:04 PM Mark Adams  wrote:
>
>>
>>
>> On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. 
>> wrote:
>>
>>>
>>>   Any explanation for why the scaling is much better for CPUs and than
>>> GPUs? Is it the "extra" time needed for communication from the GPUs?
>>>
>>
>> The GPU work is well load balanced so it weak scales perfectly. When you
>> put that work in the CPU you get more perfectly scalable work added so it
>> looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec.
>> from the 1 node to 512 node case for both GPU and CPU, because this
>> non-scaling is from communication that is the same for both cases
>>
>>
>>>
>>>   Perhaps you could try the GPU version with Junchao's new MPI-aware
>>> CUDA branch (in the gitlab merge requests)  that can speed up the
>>> communication from GPUs?
>>>
>>
>> Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?
>>
>
> Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then
> add -use_gpu_aware_mpi in option to let PETSc use that feature.
>
>
>>
>>
>>>
>>>Barry
>>>
>>>
>>> > On Aug 30, 2019, at 11:56 AM, Mark Adams  wrote:
>>> >
>>> > Here is some more weak scaling data with a fixed number of iterations
>>> (I have given a test with the numerical problems to ORNL and they said they
>>> would give it to Nvidia).
>>> >
>>> > I implemented an option to "spread" the reduced coarse grids across
>>> the whole machine as opposed to a "compact" layout where active processes
>>> are laid out in simple lexicographical order. This spread approach looks a
>>> little better.
>>> >
>>> > Mark
>>> >
>>> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
>>> wrote:
>>> >
>>> >   Ahh, PGI compiler, that explains it :-)
>>> >
>>> >   Ok, thanks. Don't worry about the runs right now. We'll figure out
>>> the fix. The code is just
>>> >
>>> >   *a = (PetscReal)strtod(name,endptr);
>>> >
>>> >   could be a compiler bug.
>>> >
>>> >
>>> >
>>> >
>>> > > On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
>>> > >
>>> > > I am getting this error with single:
>>> > >
>>> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
>>> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
>>> aijcusparse -fp_trap
>>> > > [0] 81 global equations, 27 vertices
>>> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
>>> > > [0]PETSC ERROR: The specific exception can be determined by running
>>> in a debugger.  When the
>>> > > [0]PETSC ERROR: debugger traps the signal, the exception can be
>>> found with fetestexcept(0x3e00)
>>> > > [0]PETSC ERROR: where the result is a bitwise OR of the following
>>> flags:
>>> > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400
>>> FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
>>> > > [0]PETSC ERROR: Try option -start_in_debugger
>>> > > [0]PETSC ERROR: likely location of problem given in stack below
>>> > > [0]PETSC ERROR: -  Stack Frames
>>> 
>>> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>> available,
>>> > > [0]PETSC ERROR:   INSTEAD the line number of the start of the
>>> function
>>> > > [0]PETSC ERROR:   is given.
>>> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
>>> > > [0]PETSC ERROR: [0] PetscStrtod line 1964
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
>>> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329
>>> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
>>> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869
>>> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
>>> > > [0]PETSC ERROR: - Error Message
>>> --
>>> > > [0]PETSC ERROR: Floating point exception
>>> > > [0]PETSC ERROR: trapped floating point error
>>> > > [0]PETSC ERROR: See
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>> shooting.
>>> > > [0]PETSC ERROR: Petsc Development GIT revision:
>>> v3.11.3-1685-gd3eb2e1  GIT Date: 2019-08-13 06:33:29 -0400
>>> > > 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Zhang, Junchao via petsc-dev



On Sat, Aug 31, 2019 at 8:04 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:


On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Any explanation for why the scaling is much better for CPUs and than GPUs? Is 
it the "extra" time needed for communication from the GPUs?

The GPU work is well load balanced so it weak scales perfectly. When you put 
that work in the CPU you get more perfectly scalable work added so it looks 
better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from the 
1 node to 512 node case for both GPU and CPU, because this non-scaling is from 
communication that is the same for both cases


  Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA 
branch (in the gitlab merge requests)  that can speed up the communication from 
GPUs?

Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?

Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then add 
-use_gpu_aware_mpi in option to let PETSc use that feature.



   Barry


> On Aug 30, 2019, at 11:56 AM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
>
> Here is some more weak scaling data with a fixed number of iterations (I have 
> given a test with the numerical problems to ORNL and they said they would 
> give it to Nvidia).
>
> I implemented an option to "spread" the reduced coarse grids across the whole 
> machine as opposed to a "compact" layout where active processes are laid out 
> in simple lexicographical order. This spread approach looks a little better.
>
> Mark
>
> On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Ahh, PGI compiler, that explains it :-)
>
>   Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. 
> The code is just
>
>   *a = (PetscReal)strtod(name,endptr);
>
>   could be a compiler bug.
>
>
>
>
> > On Aug 14, 2019, at 9:23 PM, Mark Adams 
> > mailto:mfad...@lbl.gov>> wrote:
> >
> > I am getting this error with single:
> >
> > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 
> > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type 
> > aijcusparse -fp_trap
> > [0] 81 global equations, 27 vertices
> > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > [0]PETSC ERROR: The specific exception can be determined by running in a 
> > debugger.  When the
> > [0]PETSC ERROR: debugger traps the signal, the exception can be found with 
> > fetestexcept(0x3e00)
> > [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 
> > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> > [0]PETSC ERROR: Try option -start_in_debugger
> > [0]PETSC ERROR: likely location of problem given in stack below
> > [0]PETSC ERROR: -  Stack Frames 
> > 
> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [0]PETSC ERROR:   INSTEAD the line number of the start of the function
> > [0]PETSC ERROR:   is given.
> > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > [0]PETSC ERROR: [0] PetscStrtod line 1964 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 
> > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 
> > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > [0]PETSC ERROR: - Error Message 
> > --
> > [0]PETSC ERROR: Floating point exception
> > [0]PETSC ERROR: trapped floating point error
> > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> > for trouble shooting.
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT 
> > Date: 2019-08-13 06:33:29 -0400
> > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named 
> > h36n11 by adams Wed Aug 14 22:21:56 2019
> > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC 
> > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" 
> > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 
> > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc 
> > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis 
> > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 
> > --with-debugging=1 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-31 Thread Mark Adams via petsc-dev
On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F.  wrote:

>
>   Any explanation for why the scaling is much better for CPUs and than
> GPUs? Is it the "extra" time needed for communication from the GPUs?
>

The GPU work is well load balanced so it weak scales perfectly. When you
put that work in the CPU you get more perfectly scalable work added so it
looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec.
from the 1 node to 512 node case for both GPU and CPU, because this
non-scaling is from communication that is the same for both cases


>
>   Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA
> branch (in the gitlab merge requests)  that can speed up the communication
> from GPUs?
>

Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?


>
>Barry
>
>
> > On Aug 30, 2019, at 11:56 AM, Mark Adams  wrote:
> >
> > Here is some more weak scaling data with a fixed number of iterations (I
> have given a test with the numerical problems to ORNL and they said they
> would give it to Nvidia).
> >
> > I implemented an option to "spread" the reduced coarse grids across the
> whole machine as opposed to a "compact" layout where active processes are
> laid out in simple lexicographical order. This spread approach looks a
> little better.
> >
> > Mark
> >
> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
> wrote:
> >
> >   Ahh, PGI compiler, that explains it :-)
> >
> >   Ok, thanks. Don't worry about the runs right now. We'll figure out the
> fix. The code is just
> >
> >   *a = (PetscReal)strtod(name,endptr);
> >
> >   could be a compiler bug.
> >
> >
> >
> >
> > > On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
> > >
> > > I am getting this error with single:
> > >
> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse -fp_trap
> > > [0] 81 global equations, 27 vertices
> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > > [0]PETSC ERROR: The specific exception can be determined by running in
> a debugger.  When the
> > > [0]PETSC ERROR: debugger traps the signal, the exception can be found
> with fetestexcept(0x3e00)
> > > [0]PETSC ERROR: where the result is a bitwise OR of the following
> flags:
> > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400
> FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> > > [0]PETSC ERROR: Try option -start_in_debugger
> > > [0]PETSC ERROR: likely location of problem given in stack below
> > > [0]PETSC ERROR: -  Stack Frames
> 
> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> > > [0]PETSC ERROR:   INSTEAD the line number of the start of the
> function
> > > [0]PETSC ERROR:   is given.
> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > > [0]PETSC ERROR: [0] PetscStrtod line 1964
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329
> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869
> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > > [0]PETSC ERROR: - Error Message
> --
> > > [0]PETSC ERROR: Floating point exception
> > > [0]PETSC ERROR: trapped floating point error
> > > [0]PETSC ERROR: See
> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1
> GIT Date: 2019-08-13 06:33:29 -0400
> > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda
> named h36n11 by adams Wed Aug 14 22:21:56 2019
> > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC
> --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon"
> FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0
> --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
> CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis
> --download-fblaslapack --with-x=0 --with-64-bit-indices=0
> --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
> > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
> > >
> --
> > >
> > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. 
> wrote:
> > >
> > >   Oh, doesn't even have 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-31 Thread Smith, Barry F. via petsc-dev


  Any explanation for why the scaling is much better for CPUs and than GPUs? Is 
it the "extra" time needed for communication from the GPUs? 

  Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA 
branch (in the gitlab merge requests)  that can speed up the communication from 
GPUs?

   Barry


> On Aug 30, 2019, at 11:56 AM, Mark Adams  wrote:
> 
> Here is some more weak scaling data with a fixed number of iterations (I have 
> given a test with the numerical problems to ORNL and they said they would 
> give it to Nvidia).
> 
> I implemented an option to "spread" the reduced coarse grids across the whole 
> machine as opposed to a "compact" layout where active processes are laid out 
> in simple lexicographical order. This spread approach looks a little better.
> 
> Mark
> 
> On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F.  wrote:
> 
>   Ahh, PGI compiler, that explains it :-)
> 
>   Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. 
> The code is just
> 
>   *a = (PetscReal)strtod(name,endptr);
> 
>   could be a compiler bug.
> 
> 
> 
> 
> > On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
> > 
> > I am getting this error with single:
> > 
> > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 
> > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type 
> > aijcusparse -fp_trap 
> > [0] 81 global equations, 27 vertices
> > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > [0]PETSC ERROR: The specific exception can be determined by running in a 
> > debugger.  When the
> > [0]PETSC ERROR: debugger traps the signal, the exception can be found with 
> > fetestexcept(0x3e00)
> > [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 
> > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> > [0]PETSC ERROR: Try option -start_in_debugger
> > [0]PETSC ERROR: likely location of problem given in stack below
> > [0]PETSC ERROR: -  Stack Frames 
> > 
> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [0]PETSC ERROR:   INSTEAD the line number of the start of the function
> > [0]PETSC ERROR:   is given.
> > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > [0]PETSC ERROR: [0] PetscStrtod line 1964 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 
> > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 
> > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > [0]PETSC ERROR: - Error Message 
> > --
> > [0]PETSC ERROR: Floating point exception
> > [0]PETSC ERROR: trapped floating point error
> > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> > for trouble shooting.
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT 
> > Date: 2019-08-13 06:33:29 -0400
> > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named 
> > h36n11 by adams Wed Aug 14 22:21:56 2019
> > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC 
> > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" 
> > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 
> > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc 
> > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis 
> > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 
> > --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
> > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
> > --
> > 
> > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F.  wrote:
> > 
> >   Oh, doesn't even have to be that large. We just need to be able to look 
> > at the flop rates (as a surrogate for run times) and compare with the 
> > previous runs. So long as the size per process is pretty much the same that 
> > is good enough.
> > 
> >Barry
> > 
> > 
> > > On Aug 14, 2019, at 8:45 PM, Mark Adams  wrote:
> > > 
> > > I can run single, I just can't scale up. But I can use like 1500 
> > > processors.
> > > 
> > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F.  
> > > wrote:
> > > 
> > >   Oh, are all your integers 8 bytes? Even on one 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev


  Ahh, PGI compiler, that explains it :-)

  Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. 
The code is just

  *a = (PetscReal)strtod(name,endptr);

  could be a compiler bug.


  

> On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
> 
> I am getting this error with single:
> 
> 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 
> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type 
> aijcusparse -fp_trap 
> [0] 81 global equations, 27 vertices
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> [0]PETSC ERROR: The specific exception can be determined by running in a 
> debugger.  When the
> [0]PETSC ERROR: debugger traps the signal, the exception can be found with 
> fetestexcept(0x3e00)
> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 
> FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> [0]PETSC ERROR: Try option -start_in_debugger
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: -  Stack Frames 
> 
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [0]PETSC ERROR:   INSTEAD the line number of the start of the function
> [0]PETSC ERROR:   is given.
> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> [0]PETSC ERROR: [0] PetscStrtod line 1964 
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> [0]PETSC ERROR: [0] KSPSetFromOptions line 329 
> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> [0]PETSC ERROR: [0] SNESSetFromOptions line 869 
> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> [0]PETSC ERROR: - Error Message 
> --
> [0]PETSC ERROR: Floating point exception
> [0]PETSC ERROR: trapped floating point error
> [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for 
> trouble shooting.
> [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT 
> Date: 2019-08-13 06:33:29 -0400
> [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named 
> h36n11 by adams Wed Aug 14 22:21:56 2019
> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC 
> --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" FOPTFLAGS="-g 
> -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 
> --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc CUDAFLAGS="-ccbin 
> pgc++" --download-metis --download-parmetis --download-fblaslapack --with-x=0 
> --with-64-bit-indices=0 --with-debugging=1 
> PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
> [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
> --
> 
> On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F.  wrote:
> 
>   Oh, doesn't even have to be that large. We just need to be able to look at 
> the flop rates (as a surrogate for run times) and compare with the previous 
> runs. So long as the size per process is pretty much the same that is good 
> enough.
> 
>Barry
> 
> 
> > On Aug 14, 2019, at 8:45 PM, Mark Adams  wrote:
> > 
> > I can run single, I just can't scale up. But I can use like 1500 processors.
> > 
> > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F.  wrote:
> > 
> >   Oh, are all your integers 8 bytes? Even on one node?
> > 
> >   Once Karl's new middleware is in place we should see about reducing to 4 
> > bytes on the GPU.
> > 
> >Barry
> > 
> > 
> > > On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> > > 
> > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 
> > > byte integers ... I could use 32 bit ints and just not scale out.
> > > 
> > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F.  
> > > wrote:
> > > 
> > >  Mark,
> > > 
> > >Oh, I don't even care if it converges, just put in a fixed number of 
> > > iterations. The idea is to just get a baseline of the possible 
> > > improvement. 
> > > 
> > > ECP is literally dropping millions into research on "multi precision" 
> > > computations on GPUs, we need to have some actual numbers for the best 
> > > potential benefit to determine how much we invest in further 
> > > investigating it, or not.
> > > 
> > > I am not expressing any opinions on the approach, we are just in the 
> > > fact gathering stage.
> > > 
> > > 
> > >Barry
> > > 
> > > 
> > > 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
I am getting this error with single:

22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
aijcusparse -fp_trap
[0] 81 global equations, 27 vertices
[0]PETSC ERROR: *** unknown floating point error occurred ***
[0]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[0]PETSC ERROR: debugger traps the signal, the exception can be found with
fetestexcept(0x3e00)
[0]PETSC ERROR: where the result is a bitwise OR of the following flags:
[0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400
FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
[0]PETSC ERROR: Try option -start_in_debugger
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: -  Stack Frames

[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:   INSTEAD the line number of the start of the function
[0]PETSC ERROR:   is given.
[0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
/autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
[0]PETSC ERROR: [0] PetscStrtod line 1964
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
[0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
[0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
[0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
[0]PETSC ERROR: [0] KSPSetFromOptions line 329
/autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
[0]PETSC ERROR: [0] SNESSetFromOptions line 869
/autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: - Error Message
--
[0]PETSC ERROR: Floating point exception
[0]PETSC ERROR: trapped floating point error
[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT
Date: 2019-08-13 06:33:29 -0400
[0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named
h36n11 by adams Wed Aug 14 22:21:56 2019
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC
--with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon"
FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0
--with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis
--download-fblaslapack --with-x=0 --with-64-bit-indices=0
--with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
[0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
--

On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F.  wrote:

>
>   Oh, doesn't even have to be that large. We just need to be able to look
> at the flop rates (as a surrogate for run times) and compare with the
> previous runs. So long as the size per process is pretty much the same that
> is good enough.
>
>Barry
>
>
> > On Aug 14, 2019, at 8:45 PM, Mark Adams  wrote:
> >
> > I can run single, I just can't scale up. But I can use like 1500
> processors.
> >
> > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. 
> wrote:
> >
> >   Oh, are all your integers 8 bytes? Even on one node?
> >
> >   Once Karl's new middleware is in place we should see about reducing to
> 4 bytes on the GPU.
> >
> >Barry
> >
> >
> > > On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> > >
> > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8
> byte integers ... I could use 32 bit ints and just not scale out.
> > >
> > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. 
> wrote:
> > >
> > >  Mark,
> > >
> > >Oh, I don't even care if it converges, just put in a fixed number
> of iterations. The idea is to just get a baseline of the possible
> improvement.
> > >
> > > ECP is literally dropping millions into research on "multi
> precision" computations on GPUs, we need to have some actual numbers for
> the best potential benefit to determine how much we invest in further
> investigating it, or not.
> > >
> > > I am not expressing any opinions on the approach, we are just in
> the fact gathering stage.
> > >
> > >
> > >Barry
> > >
> > >
> > > > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> > > >
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> > > >
> > > >   Mark,
> > > >
> > > >Would you be able to make one run using single precision? Just
> single everywhere since that is all we support currently?
> > > >
> > > >
> > > > Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev


  Oh, doesn't even have to be that large. We just need to be able to look at 
the flop rates (as a surrogate for run times) and compare with the previous 
runs. So long as the size per process is pretty much the same that is good 
enough.

   Barry


> On Aug 14, 2019, at 8:45 PM, Mark Adams  wrote:
> 
> I can run single, I just can't scale up. But I can use like 1500 processors.
> 
> On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F.  wrote:
> 
>   Oh, are all your integers 8 bytes? Even on one node?
> 
>   Once Karl's new middleware is in place we should see about reducing to 4 
> bytes on the GPU.
> 
>Barry
> 
> 
> > On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> > 
> > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte 
> > integers ... I could use 32 bit ints and just not scale out.
> > 
> > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F.  wrote:
> > 
> >  Mark,
> > 
> >Oh, I don't even care if it converges, just put in a fixed number of 
> > iterations. The idea is to just get a baseline of the possible improvement. 
> > 
> > ECP is literally dropping millions into research on "multi precision" 
> > computations on GPUs, we need to have some actual numbers for the best 
> > potential benefit to determine how much we invest in further investigating 
> > it, or not.
> > 
> > I am not expressing any opinions on the approach, we are just in the 
> > fact gathering stage.
> > 
> > 
> >Barry
> > 
> > 
> > > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> > > 
> > > 
> > > 
> > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  
> > > wrote:
> > > 
> > >   Mark,
> > > 
> > >Would you be able to make one run using single precision? Just single 
> > > everywhere since that is all we support currently? 
> > > 
> > > 
> > > Experience in engineering at least is single does not work for FE 
> > > elasticity. I have tried it many years ago and have heard this from 
> > > others. This problem is pretty simple other than using Q2. I suppose I 
> > > could try it, but just be aware the FE people might say that single sucks.
> > >  
> > >The results will give us motivation (or anti-motivation) to have 
> > > support for running KSP (or PC (or Mat)  in single precision while the 
> > > simulation is double.
> > > 
> > >Thanks.
> > > 
> > >  Barry
> > > 
> > > For example if the GPU speed on KSP is a factor of 3 over the double on 
> > > GPUs this is serious motivation. 
> > > 
> > > 
> > > > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > > > 
> > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU 
> > > > speedup with 98K dof/proc (3D Q2 elasticity).
> > > > 
> > > > This is weak scaling of a solve. There is growth in iteration count 
> > > > folded in here. I should put rtol in the title and/or run a fixed 
> > > > number of iterations and make it clear in the title.
> > > > 
> > > > Comments welcome.
> > > > 
> > > 
> > 
> 



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
I can run single, I just can't scale up. But I can use like 1500 processors.

On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F.  wrote:

>
>   Oh, are all your integers 8 bytes? Even on one node?
>
>   Once Karl's new middleware is in place we should see about reducing to 4
> bytes on the GPU.
>
>Barry
>
>
> > On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> >
> > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8
> byte integers ... I could use 32 bit ints and just not scale out.
> >
> > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. 
> wrote:
> >
> >  Mark,
> >
> >Oh, I don't even care if it converges, just put in a fixed number of
> iterations. The idea is to just get a baseline of the possible improvement.
> >
> > ECP is literally dropping millions into research on "multi
> precision" computations on GPUs, we need to have some actual numbers for
> the best potential benefit to determine how much we invest in further
> investigating it, or not.
> >
> > I am not expressing any opinions on the approach, we are just in the
> fact gathering stage.
> >
> >
> >Barry
> >
> >
> > > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> > >
> > >
> > >
> > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> > >
> > >   Mark,
> > >
> > >Would you be able to make one run using single precision? Just
> single everywhere since that is all we support currently?
> > >
> > >
> > > Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from others.
> This problem is pretty simple other than using Q2. I suppose I could try
> it, but just be aware the FE people might say that single sucks.
> > >
> > >The results will give us motivation (or anti-motivation) to have
> support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
> > >
> > >Thanks.
> > >
> > >  Barry
> > >
> > > For example if the GPU speed on KSP is a factor of 3 over the double
> on GPUs this is serious motivation.
> > >
> > >
> > > > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > > >
> > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x
> GPU speedup with 98K dof/proc (3D Q2 elasticity).
> > > >
> > > > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> > > >
> > > > Comments welcome.
> > > >
> 
> > >
> >
>
>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev


  Oh, are all your integers 8 bytes? Even on one node?

  Once Karl's new middleware is in place we should see about reducing to 4 
bytes on the GPU.
   
   Barry


> On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> 
> OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte 
> integers ... I could use 32 bit ints and just not scale out.
> 
> On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F.  wrote:
> 
>  Mark,
> 
>Oh, I don't even care if it converges, just put in a fixed number of 
> iterations. The idea is to just get a baseline of the possible improvement. 
> 
> ECP is literally dropping millions into research on "multi precision" 
> computations on GPUs, we need to have some actual numbers for the best 
> potential benefit to determine how much we invest in further investigating 
> it, or not.
> 
> I am not expressing any opinions on the approach, we are just in the fact 
> gathering stage.
> 
> 
>Barry
> 
> 
> > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> > 
> > 
> > 
> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
> > 
> >   Mark,
> > 
> >Would you be able to make one run using single precision? Just single 
> > everywhere since that is all we support currently? 
> > 
> > 
> > Experience in engineering at least is single does not work for FE 
> > elasticity. I have tried it many years ago and have heard this from others. 
> > This problem is pretty simple other than using Q2. I suppose I could try 
> > it, but just be aware the FE people might say that single sucks.
> >  
> >The results will give us motivation (or anti-motivation) to have support 
> > for running KSP (or PC (or Mat)  in single precision while the simulation 
> > is double.
> > 
> >Thanks.
> > 
> >  Barry
> > 
> > For example if the GPU speed on KSP is a factor of 3 over the double on 
> > GPUs this is serious motivation. 
> > 
> > 
> > > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > > 
> > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU 
> > > speedup with 98K dof/proc (3D Q2 elasticity).
> > > 
> > > This is weak scaling of a solve. There is growth in iteration count 
> > > folded in here. I should put rtol in the title and/or run a fixed number 
> > > of iterations and make it clear in the title.
> > > 
> > > Comments welcome.
> > > 
> > 
> 



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte
integers ... I could use 32 bit ints and just not scale out.

On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F.  wrote:

>
>  Mark,
>
>Oh, I don't even care if it converges, just put in a fixed number of
> iterations. The idea is to just get a baseline of the possible improvement.
>
> ECP is literally dropping millions into research on "multi precision"
> computations on GPUs, we need to have some actual numbers for the best
> potential benefit to determine how much we invest in further investigating
> it, or not.
>
> I am not expressing any opinions on the approach, we are just in the
> fact gathering stage.
>
>
>Barry
>
>
> > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> >
> >
> >
> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> >
> >   Mark,
> >
> >Would you be able to make one run using single precision? Just single
> everywhere since that is all we support currently?
> >
> >
> > Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from others.
> This problem is pretty simple other than using Q2. I suppose I could try
> it, but just be aware the FE people might say that single sucks.
> >
> >The results will give us motivation (or anti-motivation) to have
> support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
> >
> >Thanks.
> >
> >  Barry
> >
> > For example if the GPU speed on KSP is a factor of 3 over the double on
> GPUs this is serious motivation.
> >
> >
> > > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > >
> > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> > >
> > > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> > >
> > > Comments welcome.
> > >
> 
> >
>
>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
FYI, this test has a smooth (polynomial) body force and it runs a
convergence study.

On Wed, Aug 14, 2019 at 6:15 PM Brad Aagaard via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Q2 is often useful in problems with body forces (such as gravitational
> body forces), which tend to have linear variations in stress.
>
> On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote:
> >
> >
> > Do you have any applications that specifically want Q2 (versus Q1)
> > elasticity or have some test problems that would benefit?
> >
> >
> > No, I'm just trying to push things.
>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
"Smith, Barry F."  writes:

>> On Aug 14, 2019, at 5:58 PM, Jed Brown  wrote:
>> 
>> "Smith, Barry F."  writes:
>> 
 On Aug 14, 2019, at 2:37 PM, Jed Brown  wrote:
 
 Mark Adams via petsc-dev  writes:
 
> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  
> wrote:
> 
>> 
>> Mark,
>> 
>>  Would you be able to make one run using single precision? Just single
>> everywhere since that is all we support currently?
>> 
>> 
> Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from 
> others.
> This problem is pretty simple other than using Q2. I suppose I could try
> it, but just be aware the FE people might say that single sucks.
 
 When they say that single sucks, is it for the definition of the
 operator or the preconditioner?
 
 As point of reference, we can apply Q2 elasticity operators in double
 precision at nearly a billion dofs/second per GPU.
>>> 
>>>  And in single you get what?
>> 
>> I don't have exact numbers, but <2x faster on V100, and it sort of
>> doesn't matter because preconditioning cost will dominate.  
>
>When using block formats a much higher percentage of the bandwidth goes to 
> moving the double precision matrix entries so switching to single could 
> conceivably benefitup to almost a factor of two. 
>
> Depending on the matrix structure perhaps the column indices could be 
> handled by a shift and short j indices. Or 2 shifts and 2 sets of j indices

Shorts are a problem, but a lot of matrices are actually quite
compressible if you subtract the row from all the column indices.  I've
done some experiments using zstd and the CPU decode rate is competitive
to better than DRAM bandwidth.  But that gives up random access, which
seems important for vectorization.  Maybe someone who knows more about
decompression on GPUs can comment?

>> The big win
>> of single is on consumer-grade GPUs, which DOE doesn't install and
>> NVIDIA forbids to be used in data centers (because they're so
>> cost-effective ;-)).
>
>DOE LCFs are not our only customers. Cheap-o engineering professors
>might stack a bunch of consumer grade in their lab, would they
>benefit? Satish's basement could hold a great deal of consumer
>grades.

Fair point.  Time is also important so most companies buy the more
expensive hardware on the assumption it means less frequent problems
(due to lack of ECC, etc.).


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev



> On Aug 14, 2019, at 5:58 PM, Jed Brown  wrote:
> 
> "Smith, Barry F."  writes:
> 
>>> On Aug 14, 2019, at 2:37 PM, Jed Brown  wrote:
>>> 
>>> Mark Adams via petsc-dev  writes:
>>> 
 On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
 
> 
> Mark,
> 
>  Would you be able to make one run using single precision? Just single
> everywhere since that is all we support currently?
> 
> 
 Experience in engineering at least is single does not work for FE
 elasticity. I have tried it many years ago and have heard this from others.
 This problem is pretty simple other than using Q2. I suppose I could try
 it, but just be aware the FE people might say that single sucks.
>>> 
>>> When they say that single sucks, is it for the definition of the
>>> operator or the preconditioner?
>>> 
>>> As point of reference, we can apply Q2 elasticity operators in double
>>> precision at nearly a billion dofs/second per GPU.
>> 
>>  And in single you get what?
> 
> I don't have exact numbers, but <2x faster on V100, and it sort of
> doesn't matter because preconditioning cost will dominate.  

   When using block formats a much higher percentage of the bandwidth goes to 
moving the double precision matrix entries so switching to single could 
conceivably benefitup to almost a factor of two. 

Depending on the matrix structure perhaps the column indices could be 
handled by a shift and short j indices. Or 2 shifts and 2 sets of j indices

> The big win
> of single is on consumer-grade GPUs, which DOE doesn't install and
> NVIDIA forbids to be used in data centers (because they're so
> cost-effective ;-)).

   DOE LCFs are not our only customers. Cheap-o engineering professors might 
stack a bunch of consumer grade in their lab, would they benefit? Satish's 
basement could hold a great deal of consumer grades.

> 
>>> I'm skeptical of big wins in preconditioning (especially setup) due to
>>> the cost and irregularity of indexing being large compared to the
>>> bandwidth cost of the floating point values.



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev



> On Aug 14, 2019, at 3:36 PM, Mark Adams  wrote:
> 
> 
> 
> On Wed, Aug 14, 2019 at 3:37 PM Jed Brown  wrote:
> Mark Adams via petsc-dev  writes:
> 
> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
> >
> >>
> >>   Mark,
> >>
> >>Would you be able to make one run using single precision? Just single
> >> everywhere since that is all we support currently?
> >>
> >>
> > Experience in engineering at least is single does not work for FE
> > elasticity. I have tried it many years ago and have heard this from others.
> > This problem is pretty simple other than using Q2. I suppose I could try
> > it, but just be aware the FE people might say that single sucks.
> 
> When they say that single sucks, is it for the definition of the
> operator or the preconditioner?
> 
> Operator.
> 
> And "ve seen GMRES stagnate when using single in communication in parallel 
> Gauss-Seidel. Roundoff is nonlinear.

   When it is specific places in the algorithm that require more precision this 
can potentially be added. For example compute reductions in double. Even 
"delicate" parts of the function/Jacobian evaluation. Is it worth the bother? 
Apparently it is for the people with suitcases of money to hand out.


   
>  
> 
> As point of reference, we can apply Q2 elasticity operators in double
> precision at nearly a billion dofs/second per GPU. 
> 
> I'm skeptical of big wins in preconditioning (especially setup) due to
> the cost and irregularity of indexing being large compared to the
> bandwidth cost of the floating point values.



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
"Smith, Barry F."  writes:

>> On Aug 14, 2019, at 2:37 PM, Jed Brown  wrote:
>> 
>> Mark Adams via petsc-dev  writes:
>> 
>>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
>>> 
 
  Mark,
 
   Would you be able to make one run using single precision? Just single
 everywhere since that is all we support currently?
 
 
>>> Experience in engineering at least is single does not work for FE
>>> elasticity. I have tried it many years ago and have heard this from others.
>>> This problem is pretty simple other than using Q2. I suppose I could try
>>> it, but just be aware the FE people might say that single sucks.
>> 
>> When they say that single sucks, is it for the definition of the
>> operator or the preconditioner?
>> 
>> As point of reference, we can apply Q2 elasticity operators in double
>> precision at nearly a billion dofs/second per GPU.
>
>   And in single you get what?

I don't have exact numbers, but <2x faster on V100, and it sort of
doesn't matter because preconditioning cost will dominate.  The big win
of single is on consumer-grade GPUs, which DOE doesn't install and
NVIDIA forbids to be used in data centers (because they're so
cost-effective ;-)).

>> I'm skeptical of big wins in preconditioning (especially setup) due to
>> the cost and irregularity of indexing being large compared to the
>> bandwidth cost of the floating point values.


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev



> On Aug 14, 2019, at 2:37 PM, Jed Brown  wrote:
> 
> Mark Adams via petsc-dev  writes:
> 
>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
>> 
>>> 
>>>  Mark,
>>> 
>>>   Would you be able to make one run using single precision? Just single
>>> everywhere since that is all we support currently?
>>> 
>>> 
>> Experience in engineering at least is single does not work for FE
>> elasticity. I have tried it many years ago and have heard this from others.
>> This problem is pretty simple other than using Q2. I suppose I could try
>> it, but just be aware the FE people might say that single sucks.
> 
> When they say that single sucks, is it for the definition of the
> operator or the preconditioner?
> 
> As point of reference, we can apply Q2 elasticity operators in double
> precision at nearly a billion dofs/second per GPU.

  And in single you get what?

> 
> I'm skeptical of big wins in preconditioning (especially setup) due to
> the cost and irregularity of indexing being large compared to the
> bandwidth cost of the floating point values.



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev


 Mark,

   Oh, I don't even care if it converges, just put in a fixed number of 
iterations. The idea is to just get a baseline of the possible improvement. 

ECP is literally dropping millions into research on "multi precision" 
computations on GPUs, we need to have some actual numbers for the best 
potential benefit to determine how much we invest in further investigating it, 
or not.

I am not expressing any opinions on the approach, we are just in the fact 
gathering stage.


   Barry


> On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> 
> 
> 
> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
> 
>   Mark,
> 
>Would you be able to make one run using single precision? Just single 
> everywhere since that is all we support currently? 
> 
> 
> Experience in engineering at least is single does not work for FE elasticity. 
> I have tried it many years ago and have heard this from others. This problem 
> is pretty simple other than using Q2. I suppose I could try it, but just be 
> aware the FE people might say that single sucks.
>  
>The results will give us motivation (or anti-motivation) to have support 
> for running KSP (or PC (or Mat)  in single precision while the simulation is 
> double.
> 
>Thanks.
> 
>  Barry
> 
> For example if the GPU speed on KSP is a factor of 3 over the double on GPUs 
> this is serious motivation. 
> 
> 
> > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > 
> > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU 
> > speedup with 98K dof/proc (3D Q2 elasticity).
> > 
> > This is weak scaling of a solve. There is growth in iteration count folded 
> > in here. I should put rtol in the title and/or run a fixed number of 
> > iterations and make it clear in the title.
> > 
> > Comments welcome.
> > 
> 



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
Here is the times for KSPSolve on one node with 2,280,285 equations. These
nodes seem to have 42 cores. There are 6 "devices" (GPUs) and 7 core
attached to the device. The anomalous 28 core result could be from only
using 4 "devices".  I figure I will use 36 cores for now. I should really
do this with a lot of processors to include MPI communication...

NP   KSPSolve
205.6634e+00
244.7382e+00
286.0349e+00
324.7543e+00
364.2574e+00
424.2022e+00


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
Brad Aagaard via petsc-dev  writes:

> Q2 is often useful in problems with body forces (such as gravitational 
> body forces), which tend to have linear variations in stress.

It's similar on the free-surface Stokes side, where pressure has a
linear gradient and must be paired with a stable velocity space.

Regarding elasticity, it would be useful to have collect some
application problems where Q2 shows a big advantage.

We should be able to solve Q2 at the same or lower cost per dof to Q1
(multigrid for this case isn't off-the-shelf at present, but it's
something we're working on).

> On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote:
>> 
>> 
>> Do you have any applications that specifically want Q2 (versus Q1)
>> elasticity or have some test problems that would benefit?
>> 
>> 
>> No, I'm just trying to push things.


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Brad Aagaard via petsc-dev
Q2 is often useful in problems with body forces (such as gravitational 
body forces), which tend to have linear variations in stress.


On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote:



Do you have any applications that specifically want Q2 (versus Q1)
elasticity or have some test problems that would benefit?


No, I'm just trying to push things.


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
>
>
>
> Do you have any applications that specifically want Q2 (versus Q1)
> elasticity or have some test problems that would benefit?
>
>
No, I'm just trying to push things.


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
Mark Adams  writes:

> On Wed, Aug 14, 2019 at 3:37 PM Jed Brown  wrote:
>
>> Mark Adams via petsc-dev  writes:
>>
>> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
>> wrote:
>> >
>> >>
>> >>   Mark,
>> >>
>> >>Would you be able to make one run using single precision? Just single
>> >> everywhere since that is all we support currently?
>> >>
>> >>
>> > Experience in engineering at least is single does not work for FE
>> > elasticity. I have tried it many years ago and have heard this from
>> others.
>> > This problem is pretty simple other than using Q2. I suppose I could try
>> > it, but just be aware the FE people might say that single sucks.
>>
>> When they say that single sucks, is it for the definition of the
>> operator or the preconditioner?
>>
>
> Operator.
>
> And "ve seen GMRES stagnate when using single in communication in parallel
> Gauss-Seidel. Roundoff is nonlinear.

Fair; single may still be useful in the preconditioner while using
double for operator and Krylov.

Do you have any applications that specifically want Q2 (versus Q1)
elasticity or have some test problems that would benefit?

>> As point of reference, we can apply Q2 elasticity operators in double
>> precision at nearly a billion dofs/second per GPU.
>
>
>> I'm skeptical of big wins in preconditioning (especially setup) due to
>> the cost and irregularity of indexing being large compared to the
>> bandwidth cost of the floating point values.
>>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F.  wrote:

>
>   Mark,
>
> This is great, we can study these for months.
>
> 1) At the top of the plots you say SNES  but that can't be right, there is
> no way it is getting such speed ups for the entire SNES solve since the
> Jacobians are CPUs and take much of the time. Do you mean the KSP part of
> the SNES solve?
>

It uses KSPONLY. And solve times are KSPSolve with KSPSetUp called before.


>
> 2) For the case of a bit more than 1000 processes the speedup with GPUs is
> fantastic, more than 6?
>

I did not see that one, but it is plausible and there is some noise in this
data. The largest solve had a speedup of about 4x.


>
> 3) People will ask about runs using all 48 CPUs, since they are there it
> is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to
> memory bandwidth limits 48 won't be much better than 24 but you need it in
> your back pocket for completeness.
>
>
Ah, good point. I just cut and paste but I should run a little test and see
where it saturates.


> 4) From the table
>
> KSPSolve   1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02
> 8.3e+01  0  0  4  0  3  10 57 97 52 81  19113494114 3.06e-01  129
> 1.38e-01 84
> PCApply   17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02
> 3.4e+01  0  0  3  0  1   8 49 81 44 33  19684007 98 2.58e-01  113
> 1.19e-01 81
>
> only 84 percent of the total flops in the KSPSolve are on the GPU and only
> 81 for the PCApply() where are the rest? MatMult() etc are doing 100
> percent on the GPU, MatSolve on the coarsest level should be tiny and not
> taking 19 percent of the flops?
>
>
That is the smallest test with 3465 equations on 24 cores. the R and P and
coarse grid are on the CPU. Look at larger tests.


>   Thanks
>
>Barry
>
>
> > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> >
> > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> >
> > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> >
> > Comments welcome.
> >
> 
>
>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
On Wed, Aug 14, 2019 at 3:37 PM Jed Brown  wrote:

> Mark Adams via petsc-dev  writes:
>
> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> >
> >>
> >>   Mark,
> >>
> >>Would you be able to make one run using single precision? Just single
> >> everywhere since that is all we support currently?
> >>
> >>
> > Experience in engineering at least is single does not work for FE
> > elasticity. I have tried it many years ago and have heard this from
> others.
> > This problem is pretty simple other than using Q2. I suppose I could try
> > it, but just be aware the FE people might say that single sucks.
>
> When they say that single sucks, is it for the definition of the
> operator or the preconditioner?
>

Operator.

And "ve seen GMRES stagnate when using single in communication in parallel
Gauss-Seidel. Roundoff is nonlinear.


>
> As point of reference, we can apply Q2 elasticity operators in double
> precision at nearly a billion dofs/second per GPU.


> I'm skeptical of big wins in preconditioning (especially setup) due to
> the cost and irregularity of indexing being large compared to the
> bandwidth cost of the floating point values.
>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
Mark Adams via petsc-dev  writes:

> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:
>
>>
>>   Mark,
>>
>>Would you be able to make one run using single precision? Just single
>> everywhere since that is all we support currently?
>>
>>
> Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from others.
> This problem is pretty simple other than using Q2. I suppose I could try
> it, but just be aware the FE people might say that single sucks.

When they say that single sucks, is it for the definition of the
operator or the preconditioner?

As point of reference, we can apply Q2 elasticity operators in double
precision at nearly a billion dofs/second per GPU.

I'm skeptical of big wins in preconditioning (especially setup) due to
the cost and irregularity of indexing being large compared to the
bandwidth cost of the floating point values.


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:

>
>   Mark,
>
>Would you be able to make one run using single precision? Just single
> everywhere since that is all we support currently?
>
>
Experience in engineering at least is single does not work for FE
elasticity. I have tried it many years ago and have heard this from others.
This problem is pretty simple other than using Q2. I suppose I could try
it, but just be aware the FE people might say that single sucks.


>The results will give us motivation (or anti-motivation) to have
> support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
>
>Thanks.
>
>  Barry
>
> For example if the GPU speed on KSP is a factor of 3 over the double on
> GPUs this is serious motivation.
>
>
> > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> >
> > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> >
> > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> >
> > Comments welcome.
> >
> 
>
>


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev


  Mark,

   Would you be able to make one run using single precision? Just single 
everywhere since that is all we support currently? 

   The results will give us motivation (or anti-motivation) to have support for 
running KSP (or PC (or Mat)  in single precision while the simulation is double.

   Thanks.

 Barry

For example if the GPU speed on KSP is a factor of 3 over the double on GPUs 
this is serious motivation. 


> On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> 
> FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU 
> speedup with 98K dof/proc (3D Q2 elasticity).
> 
> This is weak scaling of a solve. There is growth in iteration count folded in 
> here. I should put rtol in the title and/or run a fixed number of iterations 
> and make it clear in the title.
> 
> Comments welcome.
> 



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev


  Mark,

This is great, we can study these for months. 

1) At the top of the plots you say SNES  but that can't be right, there is no 
way it is getting such speed ups for the entire SNES solve since the Jacobians 
are CPUs and take much of the time. Do you mean the KSP part of the SNES solve? 

2) For the case of a bit more than 1000 processes the speedup with GPUs is 
fantastic, more than 6?

3) People will ask about runs using all 48 CPUs, since they are there it is a 
little unfair to only compare 24 CPUs with the GPUs. Presumably due to memory 
bandwidth limits 48 won't be much better than 24 but you need it in your back 
pocket for completeness.

4) From the table

KSPSolve   1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 
8.3e+01  0  0  4  0  3  10 57 97 52 81  19113494114 3.06e-01  129 
1.38e-01 84
PCApply   17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 
3.4e+01  0  0  3  0  1   8 49 81 44 33  19684007 98 2.58e-01  113 
1.19e-01 81

only 84 percent of the total flops in the KSPSolve are on the GPU and only 81 
for the PCApply() where are the rest? MatMult() etc are doing 100 percent on 
the GPU, MatSolve on the coarsest level should be tiny and not taking 19 
percent of the flops?

  Thanks

   Barry


> On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> 
> FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU 
> speedup with 98K dof/proc (3D Q2 elasticity).
> 
> This is weak scaling of a solve. There is growth in iteration count folded in 
> here. I should put rtol in the title and/or run a fixed number of iterations 
> and make it clear in the title.
> 
> Comments welcome.
> 



Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
>
>
> 3) Is comparison between pointers appropriate? For example if (dptr !=
> zarray) { is scary if some arrays are zero length how do we know what the
> pointer value will be?
>
>
Yes, you need to consider these cases, which is kind of error prone.

Also, I think merging transpose,and not,is a good idea because the way the
code is setup it is easy. You just grab a different cached object and keep
your rmaps and cmaps straight,I think.


Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Smith, Barry F. via petsc-dev


  My concern is

1) is it actually optimally efficient for all cases? This kind of stuff, IMHO

if (yy) {
  if (dptr != zarray) {
ierr = VecCopy_SeqCUDA(yy,zz);CHKERRQ(ierr);
  } else if (zz != yy) {
ierr = VecAXPY_SeqCUDA(zz,1.0,yy);CHKERRQ(ierr);
  }
} else if (dptr != zarray) {
  ierr = VecSet_SeqCUDA(zz,0);CHKERRQ(ierr);
}

means it is not. It is launching additional kernels and looping over arrays 
more times then if each form was optimized for its one case.

2) is it utilizing VecCUDAGetArrayWrite() when possible? No, it uses 
VecCUDAGetArray() which for certain configurations means copying from CPU stuff 
that will immediately be overwritten. Sometimes it can use 
VecCUDAGetArrayWrite() sometimes it can't, code has handle each case properly.

3) Is comparison between pointers appropriate? For example if (dptr != zarray) 
{ is scary if some arrays are zero length how do we know what the pointer value 
will be?


  I am not saying it is totally impossible to have a single routine that 
optimally efficiently did all cases: MatMult, yy == zz, etc but the resulting 
code will be real complex with lots of if()s and difficult to understand and 
maintain; just tracing through all cases and insuring each is optimal is 
nontrivial.

   Barry

> On Jul 10, 2019, at 11:01 AM, Stefano Zampini  
> wrote:
> 
> Barry,
> 
> I think having a single code instead of three different, quasi similar, 
> versions is less fragile ( I admit, once you get the logic correct...)
> Also, it conforms with the standard for spmv that implements alpha * A * x + 
> beta * b
> The easiest fix is the following: 
> 
> Rename MatMultAdd_ into MatMultKernel_Private and add an extra boolean to 
> control the transpose operation
> then, you can reuse the same complicated code I have wrote, just by selecting 
> the proper cusparse object (matstructT or matstruct)
> 
> 
> Il giorno mer 10 lug 2019 alle ore 18:16 Smith, Barry F.  
> ha scritto:
> 
>In the long run I would like to see smaller specialized chunks of code 
> (with a bit of duplication between them) instead of highly overloaded 
> routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, 
> for multiple add alone and for multiple add with sparse format. Trying to get 
> all the cases right (performance and correctness for the everything at once 
> is unnecessary and risky). Having possible zero size objects  (and hence null 
> pointers) doesn't help the complex logic
> 
> 
>Barry
> 
> 
> > On Jul 10, 2019, at 10:06 AM, Mark Adams  wrote:
> > 
> > Thanks, you made several changes here, including switches with the 
> > workvector size. I guess I should import this logic to the transpose 
> > method(s), except for the yy==NULL branches ...
> > 
> > MatMult_ calls MatMultAdd with yy=0, but the transpose version have their 
> > own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. 
> > 
> > Thanks again,
> > Mark
> > 
> > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini  
> > wrote:
> > Mark,
> > 
> > if the difference is on lvec, I suspect the bug has to do with compressed 
> > row storage. I have fixed a similar bug in MatMult.
> > you want to check cusparsestruct->workVector->size() against A->cmap->n.
> > 
> > Stefano 
> > 
> > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev 
> >  ha scritto:
> > 
> > 
> > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F.  wrote:
> > 
> >   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
> >   if (nt != A->rmap->n) 
> > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A 
> > (%D) and xx (%D)",A->rmap->n,nt);
> >   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
> >   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> > 
> > So the xx on the GPU appears ok?
> > 
> > The norm is correct and ...
> >  
> > The a->B appears ok?
> > 
> > yes
> >  
> > But on process 1 the result a->lvec is wrong? 
> > 
> > yes
> > 
> > 
> > How do you look at the a->lvec? Do you copy it to the CPU and print it?
> > 
> > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I 
> > should copy it. Maybe I should make a CUDA version of these methods?
> >  
> > 
> >   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
> >   ierr = 
> > VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = 
> > VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> > 
> > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? 
> > 
> > This is where I have been digging around an printing stuff.
> >  
> > 
> > Are you sure the problem isn't related to the "stream business"? 
> > 
> > I don't know what that is but I have played around with adding 
> > cudaDeviceSynchronize
> >  
> > 
> > /* This multiplication sequence is different sequence
> >  than the CPU version. In particular, the 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
Yea, I agree. Once this is working, I'll go back and split MatMultAdd, etc.

On Wed, Jul 10, 2019 at 11:16 AM Smith, Barry F.  wrote:

>
>In the long run I would like to see smaller specialized chunks of code
> (with a bit of duplication between them) instead of highly overloaded
> routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple
> alone, for multiple add alone and for multiple add with sparse format.
> Trying to get all the cases right (performance and correctness for the
> everything at once is unnecessary and risky). Having possible zero size
> objects  (and hence null pointers) doesn't help the complex logic
>
>
>Barry
>
>
> > On Jul 10, 2019, at 10:06 AM, Mark Adams  wrote:
> >
> > Thanks, you made several changes here, including switches with the
> workvector size. I guess I should import this logic to the transpose
> method(s), except for the yy==NULL branches ...
> >
> > MatMult_ calls MatMultAdd with yy=0, but the transpose version have
> their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple.
> >
> > Thanks again,
> > Mark
> >
> > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <
> stefano.zamp...@gmail.com> wrote:
> > Mark,
> >
> > if the difference is on lvec, I suspect the bug has to do with
> compressed row storage. I have fixed a similar bug in MatMult.
> > you want to check cusparsestruct->workVector->size() against A->cmap->n.
> >
> > Stefano
> >
> > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> ha scritto:
> >
> >
> > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. 
> wrote:
> >
> >   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
> >   if (nt != A->rmap->n)
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
> (%D) and xx (%D)",A->rmap->n,nt);
> >   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
> >   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> >
> > So the xx on the GPU appears ok?
> >
> > The norm is correct and ...
> >
> > The a->B appears ok?
> >
> > yes
> >
> > But on process 1 the result a->lvec is wrong?
> >
> > yes
> >
> >
> > How do you look at the a->lvec? Do you copy it to the CPU and print it?
> >
> > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
> should copy it. Maybe I should make a CUDA version of these methods?
> >
> >
> >   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
> >   ierr =
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr =
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> >
> > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
> >
> > This is where I have been digging around an printing stuff.
> >
> >
> > Are you sure the problem isn't related to the "stream business"?
> >
> > I don't know what that is but I have played around with adding
> cudaDeviceSynchronize
> >
> >
> > /* This multiplication sequence is different sequence
> >  than the CPU version. In particular, the diagonal block
> >  multiplication kernel is launched in one stream. Then,
> >  in a separate stream, the data transfers from DeviceToHost
> >  (with MPI messaging in between), then HostToDevice are
> >  launched. Once the data transfer stream is synchronized,
> >  to ensure messaging is complete, the MatMultAdd kernel
> >  is launched in the original (MatMult) stream to protect
> >  against race conditions.
> >
> >  This sequence should only be called for GPU computation. */
> >
> > Note this comment isn't right and appears to be cut and paste from
> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
> >
> > Yes, I noticed this. Same as MatMult and not correct here.
> >
> >
> > Anyway to "turn off the stream business" and see if the result is then
> correct?
> >
> > How do you do that? I'm looking at docs on streams but not sure how its
> used here.
> >
> > Perhaps the stream business was done correctly for MatMult() but was
> never right for MatMultTranspose()?
> >
> > Barry
> >
> > BTW: Unrelated comment, the code
> >
> >   ierr = VecSet(yy,0);CHKERRQ(ierr);
> >   ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);
> >
> > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
> them all yourself so setting them to zero before calling
> VecCUDAGetArrayWrite() does nothing except waste time.
> >
> >
> > OK, I'll get rid of it.
> >
> >
> > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> > >
> > > I am stumped with this GPU bug(s). Maybe someone has an idea.
> > >
> > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
> detected, but I still have differences between the GPU and CPU transpose
> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
> with 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Smith, Barry F. via petsc-dev


   In the long run I would like to see smaller specialized chunks of code (with 
a bit of duplication between them) instead of highly overloaded routines like 
MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, for multiple add 
alone and for multiple add with sparse format. Trying to get all the cases 
right (performance and correctness for the everything at once is unnecessary 
and risky). Having possible zero size objects  (and hence null pointers) 
doesn't help the complex logic


   Barry


> On Jul 10, 2019, at 10:06 AM, Mark Adams  wrote:
> 
> Thanks, you made several changes here, including switches with the workvector 
> size. I guess I should import this logic to the transpose method(s), except 
> for the yy==NULL branches ...
> 
> MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own 
> code. MatMultTranspose_SeqAIJCUSPARSE is very simple. 
> 
> Thanks again,
> Mark
> 
> On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini  
> wrote:
> Mark,
> 
> if the difference is on lvec, I suspect the bug has to do with compressed row 
> storage. I have fixed a similar bug in MatMult.
> you want to check cusparsestruct->workVector->size() against A->cmap->n.
> 
> Stefano 
> 
> Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev 
>  ha scritto:
> 
> 
> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F.  wrote:
> 
>   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
>   if (nt != A->rmap->n) 
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) 
> and xx (%D)",A->rmap->n,nt);
>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> 
> So the xx on the GPU appears ok?
> 
> The norm is correct and ...
>  
> The a->B appears ok?
> 
> yes
>  
> But on process 1 the result a->lvec is wrong? 
> 
> yes
> 
> 
> How do you look at the a->lvec? Do you copy it to the CPU and print it?
> 
> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I should 
> copy it. Maybe I should make a CUDA version of these methods?
>  
> 
>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr = 
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = 
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> 
> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? 
> 
> This is where I have been digging around an printing stuff.
>  
> 
> Are you sure the problem isn't related to the "stream business"? 
> 
> I don't know what that is but I have played around with adding 
> cudaDeviceSynchronize
>  
> 
> /* This multiplication sequence is different sequence
>  than the CPU version. In particular, the diagonal block
>  multiplication kernel is launched in one stream. Then,
>  in a separate stream, the data transfers from DeviceToHost
>  (with MPI messaging in between), then HostToDevice are
>  launched. Once the data transfer stream is synchronized,
>  to ensure messaging is complete, the MatMultAdd kernel
>  is launched in the original (MatMult) stream to protect
>  against race conditions.
> 
>  This sequence should only be called for GPU computation. */
> 
> Note this comment isn't right and appears to be cut and paste from somewhere 
> else, since there is no MatMult() nor MatMultAdd kernel here?
> 
> Yes, I noticed this. Same as MatMult and not correct here.
>  
> 
> Anyway to "turn off the stream business" and see if the result is then 
> correct? 
> 
> How do you do that? I'm looking at docs on streams but not sure how its used 
> here.
>  
> Perhaps the stream business was done correctly for MatMult() but was never 
> right for MatMultTranspose()?
> 
> Barry
> 
> BTW: Unrelated comment, the code
> 
>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>   ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);
> 
> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. 
> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set them 
> all yourself so setting them to zero before calling VecCUDAGetArrayWrite() 
> does nothing except waste time.
> 
> 
> OK, I'll get rid of it.
>  
> 
> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev 
> >  wrote:
> > 
> > I am stumped with this GPU bug(s). Maybe someone has an idea.
> > 
> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, 
> > but I still have differences between the GPU and CPU transpose mat-vec. 
> > I've got it down to a very simple test: bicg/none on a tiny mesh with two 
> > processors. It works on one processor or with cg/none. So it is the 
> > transpose mat-vec.
> > 
> > I see that the result of the off-diagonal  (a->lvec) is different only proc 
> > 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and 
> > vec and printed out matlab vectors. Below is the CPU output and then the 
> 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
Thanks, you made several changes here, including switches with the
workvector size. I guess I should import this logic to the transpose
method(s), except for the yy==NULL branches ...

MatMult_ calls MatMultAdd with yy=0, but the transpose version have their
own code. MatMultTranspose_SeqAIJCUSPARSE is very simple.

Thanks again,
Mark

On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini 
wrote:

> Mark,
>
> if the difference is on lvec, I suspect the bug has to do with compressed
> row storage. I have fixed a similar bug in MatMult.
> you want to check cusparsestruct->workVector->size() against A->cmap->n.
>
> Stefano
>
> Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> ha scritto:
>
>>
>>
>> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. 
>> wrote:
>>
>>>
>>>   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
>>>   if (nt != A->rmap->n)
>>> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
>>> (%D) and xx (%D)",A->rmap->n,nt);
>>>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>>>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>>>
>>> So the xx on the GPU appears ok?
>>
>>
>> The norm is correct and ...
>>
>>
>>> The a->B appears ok?
>>
>>
>> yes
>>
>>
>>> But on process 1 the result a->lvec is wrong?
>>>
>>
>> yes
>>
>>
>>> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>>>
>>
>> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
>> should copy it. Maybe I should make a CUDA version of these methods?
>>
>>
>>>
>>>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>>>   ierr =
>>> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>>   ierr =
>>> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>>>
>>> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
>>
>>
>> This is where I have been digging around an printing stuff.
>>
>>
>>>
>>> Are you sure the problem isn't related to the "stream business"?
>>>
>>
>> I don't know what that is but I have played around with adding
>> cudaDeviceSynchronize
>>
>>
>>>
>>> /* This multiplication sequence is different sequence
>>>  than the CPU version. In particular, the diagonal block
>>>  multiplication kernel is launched in one stream. Then,
>>>  in a separate stream, the data transfers from DeviceToHost
>>>  (with MPI messaging in between), then HostToDevice are
>>>  launched. Once the data transfer stream is synchronized,
>>>  to ensure messaging is complete, the MatMultAdd kernel
>>>  is launched in the original (MatMult) stream to protect
>>>  against race conditions.
>>>
>>>  This sequence should only be called for GPU computation. */
>>>
>>> Note this comment isn't right and appears to be cut and paste from
>>> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>>>
>>
>> Yes, I noticed this. Same as MatMult and not correct here.
>>
>>
>>>
>>> Anyway to "turn off the stream business" and see if the result is then
>>> correct?
>>
>>
>> How do you do that? I'm looking at docs on streams but not sure how its
>> used here.
>>
>>
>>> Perhaps the stream business was done correctly for MatMult() but was
>>> never right for MatMultTranspose()?
>>>
>>> Barry
>>>
>>> BTW: Unrelated comment, the code
>>>
>>>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>>>   ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);
>>>
>>> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
>>> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
>>> them all yourself so setting them to zero before calling
>>> VecCUDAGetArrayWrite() does nothing except waste time.
>>>
>>>
>> OK, I'll get rid of it.
>>
>>
>>>
>>> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
>>> petsc-dev@mcs.anl.gov> wrote:
>>> >
>>> > I am stumped with this GPU bug(s). Maybe someone has an idea.
>>> >
>>> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
>>> detected, but I still have differences between the GPU and CPU transpose
>>> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
>>> with two processors. It works on one processor or with cg/none. So it is
>>> the transpose mat-vec.
>>> >
>>> > I see that the result of the off-diagonal  (a->lvec) is different only
>>> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
>>> and vec and printed out matlab vectors. Below is the CPU output and then
>>> the GPU with a view of the scatter object, which is identical as you can
>>> see.
>>> >
>>> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
>>> wrong ...
>>> >
>>> > The only/first difference between CPU and GPU is a->lvec (the off
>>> diagonal contribution)on processor 1. (you can see the norms are
>>> different). Here is the diff on the process 1 a->lvec vector (all values
>>> 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F.  wrote:

>
>   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
>   if (nt != A->rmap->n)
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
> (%D) and xx (%D)",A->rmap->n,nt);
>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>
> So the xx on the GPU appears ok?


The norm is correct and ...


> The a->B appears ok?


yes


> But on process 1 the result a->lvec is wrong?
>

yes


> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>

I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
should copy it. Maybe I should make a CUDA version of these methods?


>
>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr =
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr =
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>
> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?


This is where I have been digging around an printing stuff.


>
> Are you sure the problem isn't related to the "stream business"?
>

I don't know what that is but I have played around with adding
cudaDeviceSynchronize


>
> /* This multiplication sequence is different sequence
>  than the CPU version. In particular, the diagonal block
>  multiplication kernel is launched in one stream. Then,
>  in a separate stream, the data transfers from DeviceToHost
>  (with MPI messaging in between), then HostToDevice are
>  launched. Once the data transfer stream is synchronized,
>  to ensure messaging is complete, the MatMultAdd kernel
>  is launched in the original (MatMult) stream to protect
>  against race conditions.
>
>  This sequence should only be called for GPU computation. */
>
> Note this comment isn't right and appears to be cut and paste from
> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>

Yes, I noticed this. Same as MatMult and not correct here.


>
> Anyway to "turn off the stream business" and see if the result is then
> correct?


How do you do that? I'm looking at docs on streams but not sure how its
used here.


> Perhaps the stream business was done correctly for MatMult() but was never
> right for MatMultTranspose()?
>
> Barry
>
> BTW: Unrelated comment, the code
>
>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>   ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);
>
> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
> them all yourself so setting them to zero before calling
> VecCUDAGetArrayWrite() does nothing except waste time.
>
>
OK, I'll get rid of it.


>
> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > I am stumped with this GPU bug(s). Maybe someone has an idea.
> >
> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
> detected, but I still have differences between the GPU and CPU transpose
> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
> with two processors. It works on one processor or with cg/none. So it is
> the transpose mat-vec.
> >
> > I see that the result of the off-diagonal  (a->lvec) is different only
> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
> and vec and printed out matlab vectors. Below is the CPU output and then
> the GPU with a view of the scatter object, which is identical as you can
> see.
> >
> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
> wrong ...
> >
> > The only/first difference between CPU and GPU is a->lvec (the off
> diagonal contribution)on processor 1. (you can see the norms are
> different). Here is the diff on the process 1 a->lvec vector (all values
> are off).
> >
> > Any thoughts would be appreciated,
> > Mark
> >
> > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> > 2,12c2,12
> > < %  type: seqcuda
> > < Vec_0x53738630_0 = [
> > < 9.5702137431412879e+00
> > < 2.1970298791152253e+01
> > < 4.5422290209190646e+00
> > < 2.0185031807270226e+00
> > < 4.2627312508573375e+01
> > < 1.0889191983882025e+01
> > < 1.6038202417695462e+01
> > < 2.7155672033607665e+01
> > < 6.2540357853223556e+00
> > ---
> > > %  type: seq
> > > Vec_0x3a546440_0 = [
> > > 4.5565851251714653e+00
> > > 1.0460532998971189e+01
> > > 2.1626531807270220e+00
> > > 9.6105288923182408e-01
> > > 2.0295782656035659e+01
> > > 5.1845791066529463e+00
> > > 7.6361340020576058e+00
> > > 1.2929401011659799e+01
> > > 2.9776812928669392e+00
> >
> > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
> ./ex56 -cells 2,2,1
> > [0] 27 global equations, 9 vertices
> > [0] 27 equations in vector, 9 vertices
> >   0 SNES Function norm 1.223958326481e+02
> > 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-09 Thread Smith, Barry F. via petsc-dev


  ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
  if (nt != A->rmap->n) 
SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) 
and xx (%D)",A->rmap->n,nt);
  ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
  ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);

So the xx on the GPU appears ok? The a->B appears ok? But on process 1 the 
result a->lvec is wrong? 

How do you look at the a->lvec? Do you copy it to the CPU and print it?

  ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
  ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);

Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? 

Are you sure the problem isn't related to the "stream business"? 

/* This multiplication sequence is different sequence
 than the CPU version. In particular, the diagonal block
 multiplication kernel is launched in one stream. Then,
 in a separate stream, the data transfers from DeviceToHost
 (with MPI messaging in between), then HostToDevice are
 launched. Once the data transfer stream is synchronized,
 to ensure messaging is complete, the MatMultAdd kernel
 is launched in the original (MatMult) stream to protect
 against race conditions.

 This sequence should only be called for GPU computation. */

Note this comment isn't right and appears to be cut and paste from somewhere 
else, since there is no MatMult() nor MatMultAdd kernel here?

Anyway to "turn off the stream business" and see if the result is then correct? 
 Perhaps the stream business was done correctly for MatMult() but was never 
right for MatMultTranspose()?

Barry

BTW: Unrelated comment, the code

  ierr = VecSet(yy,0);CHKERRQ(ierr);
  ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);

has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. VecCUDAGetArrayWrite() 
requires that you ignore the values in yy and set them all yourself so setting 
them to zero before calling VecCUDAGetArrayWrite() does nothing except waste 
time.


> On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev  
> wrote:
> 
> I am stumped with this GPU bug(s). Maybe someone has an idea.
> 
> I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, 
> but I still have differences between the GPU and CPU transpose mat-vec. I've 
> got it down to a very simple test: bicg/none on a tiny mesh with two 
> processors. It works on one processor or with cg/none. So it is the transpose 
> mat-vec.
> 
> I see that the result of the off-diagonal  (a->lvec) is different only proc 
> 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec 
> and printed out matlab vectors. Below is the CPU output and then the GPU with 
> a view of the scatter object, which is identical as you can see.
> 
> The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong 
> ...
> 
> The only/first difference between CPU and GPU is a->lvec (the off diagonal 
> contribution)on processor 1. (you can see the norms are different). Here is 
> the diff on the process 1 a->lvec vector (all values are off).
> 
> Any thoughts would be appreciated,
> Mark
> 
> 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> 2,12c2,12
> < %  type: seqcuda
> < Vec_0x53738630_0 = [
> < 9.5702137431412879e+00
> < 2.1970298791152253e+01
> < 4.5422290209190646e+00
> < 2.0185031807270226e+00
> < 4.2627312508573375e+01
> < 1.0889191983882025e+01
> < 1.6038202417695462e+01
> < 2.7155672033607665e+01
> < 6.2540357853223556e+00
> ---
> > %  type: seq
> > Vec_0x3a546440_0 = [
> > 4.5565851251714653e+00
> > 1.0460532998971189e+01
> > 2.1626531807270220e+00
> > 9.6105288923182408e-01
> > 2.0295782656035659e+01
> > 5.1845791066529463e+00
> > 7.6361340020576058e+00
> > 1.2929401011659799e+01
> > 2.9776812928669392e+00
> 
> 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 
> ./ex56 -cells 2,2,1 
> [0] 27 global equations, 9 vertices
> [0] 27 equations in vector, 9 vertices
>   0 SNES Function norm 1.223958326481e+02 
> 0 KSP Residual norm 1.223958326481e+02 
> [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  
> 1.424708937136e+00
> [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=  
> 1.424708937136e+00
> [1] 1) |yy|=  2.007423334680e+02
> [0] 1) |yy|=  2.007423334680e+02
> [0] 2) |yy|=  1.957605719265e+02
> [1] 2) |yy|=  1.957605719265e+02
> [1] Number sends = 1; Number to self = 0
> [1]   0 length = 9 to whom 0
> Now the indices for all remote sends (in order by process sent to)
> [1] 9 
> [1] 10 
> [1] 11 
> [1] 12 
> [1] 13 
> [1] 14 
> [1] 15 
> [1] 16 
> [1] 17 
> [1] Number receives = 1; Number from self = 0
> [1] 0 length 9 from whom 0
> Now the indices for all remote receives (in order by process received from)
> [1] 0 
> [1] 1 

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-09 Thread Mark Adams via petsc-dev
I am stumped with this GPU bug(s). Maybe someone has an idea.

I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected,
but I still have differences between the GPU and CPU transpose mat-vec.
I've got it down to a very simple test: bicg/none on a tiny mesh with two
processors. It works on one processor or with cg/none. So it is the
transpose mat-vec.

I see that the result of the off-diagonal  (a->lvec) is different* only
proc 1*. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
and vec and printed out matlab vectors. Below is the CPU output and then
the GPU with a view of the scatter object, which is identical as you can
see.

The matlab B matrix and xx vector are identical. Maybe the GPU copy
is wrong ...

The only/first difference between CPU and GPU is a->lvec (the off diagonal
contribution)on processor 1. (you can see the norms are *different*). Here
is the diff on the process 1 a->lvec vector (all values are off).

Any thoughts would be appreciated,
Mark

15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
2,12c2,12
< %  type: seqcuda
< Vec_0x53738630_0 = [
< 9.5702137431412879e+00
< 2.1970298791152253e+01
< 4.5422290209190646e+00
< 2.0185031807270226e+00
< 4.2627312508573375e+01
< 1.0889191983882025e+01
< 1.6038202417695462e+01
< 2.7155672033607665e+01
< 6.2540357853223556e+00
---
> %  type: seq
> Vec_0x3a546440_0 = [
> 4.5565851251714653e+00
> 1.0460532998971189e+01
> 2.1626531807270220e+00
> 9.6105288923182408e-01
> 2.0295782656035659e+01
> 5.1845791066529463e+00
> 7.6361340020576058e+00
> 1.2929401011659799e+01
> 2.9776812928669392e+00

15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
./ex56 -cells 2,2,1
[0] 27 global equations, 9 vertices
[0] 27 equations in vector, 9 vertices
  0 SNES Function norm 1.223958326481e+02
0 KSP Residual norm 1.223958326481e+02
[0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
 1.424708937136e+00
[1] |x|=  1.223958326481e+02 |a->lvec|=  *2.844171413778e*+01 |B|=
 1.424708937136e+00
[1] 1) |yy|=  2.007423334680e+02
[0] 1) |yy|=  2.007423334680e+02
[0] 2) |yy|=  1.957605719265e+02
[1] 2) |yy|=  1.957605719265e+02
[1] Number sends = 1; Number to self = 0
[1]   0 length = 9 to whom 0
Now the indices for all remote sends (in order by process sent to)
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] Number receives = 1; Number from self = 0
[1] 0 length 9 from whom 0
Now the indices for all remote receives (in order by process received from)
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
1 KSP Residual norm 8.199932342150e+01
  Linear solve did not converge due to DIVERGED_ITS iterations 1
Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0


15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56
-cells 2,2,1 *-ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda*
[0] 27 global equations, 9 vertices
[0] 27 equations in vector, 9 vertices
  0 SNES Function norm 1.223958326481e+02
0 KSP Residual norm 1.223958326481e+02
[0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
 1.424708937136e+00
[1] |x|=  1.223958326481e+02 |a->lvec|=  *5.973624458725e*+01 |B|=
 1.424708937136e+00
[0] 1) |yy|=  2.007423334680e+02
[1] 1) |yy|=  2.007423334680e+02
[0] 2) |yy|=  1.953571867633e+02
[1] 2) |yy|=  1.953571867633e+02
[1] Number sends = 1; Number to self = 0
[1]   0 length = 9 to whom 0
Now the indices for all remote sends (in order by process sent to)
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] Number receives = 1; Number from self = 0
[1] 0 length 9 from whom 0
Now the indices for all remote receives (in order by process received from)
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
1 KSP Residual norm 8.199932342150e+01