Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
git branch --contains barry/2019-09-01/robustify-version-check balay/jed-gitlab-ci master Make a new branch from your current branch, add like -feature-sf-on-gpu to the end of the name and merge in jczhang/feature-sf-on-gpu configure and test with that. Barry > On Sep 1, 2019, at 9:50 AM, Mark Adams wrote: > > Junchao and Barry, > > I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's > robustify branch. Is this in master yet? If so, I'd like to get my branch > merged to master, then merge Junchao's branch. Then us it. > > I think we were waiting for some refactoring from Karl to proceed. > > Anyway, I'm not sure how to proceed. > > Thanks, > Mark > > > On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao wrote: > > > > On Sat, Aug 31, 2019 at 8:04 PM Mark Adams wrote: > > > On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. wrote: > > Any explanation for why the scaling is much better for CPUs and than GPUs? > Is it the "extra" time needed for communication from the GPUs? > > The GPU work is well load balanced so it weak scales perfectly. When you put > that work in the CPU you get more perfectly scalable work added so it looks > better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from > the 1 node to 512 node case for both GPU and CPU, because this non-scaling is > from communication that is the same for both cases > > > Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA > branch (in the gitlab merge requests) that can speed up the communication > from GPUs? > > Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual? > > Use jsrun --smpiargs="-gpu" to enable IBM MPI's cuda-aware support, then add > -use_gpu_aware_mpi in option to let PETSc use that feature. > > > >Barry > > > > On Aug 30, 2019, at 11:56 AM, Mark Adams wrote: > > > > Here is some more weak scaling data with a fixed number of iterations (I > > have given a test with the numerical problems to ORNL and they said they > > would give it to Nvidia). > > > > I implemented an option to "spread" the reduced coarse grids across the > > whole machine as opposed to a "compact" layout where active processes are > > laid out in simple lexicographical order. This spread approach looks a > > little better. > > > > Mark > > > > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. wrote: > > > > Ahh, PGI compiler, that explains it :-) > > > > Ok, thanks. Don't worry about the runs right now. We'll figure out the > > fix. The code is just > > > > *a = (PetscReal)strtod(name,endptr); > > > > could be a compiler bug. > > > > > > > > > > > On Aug 14, 2019, at 9:23 PM, Mark Adams wrote: > > > > > > I am getting this error with single: > > > > > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 > > > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type > > > aijcusparse -fp_trap > > > [0] 81 global equations, 27 vertices > > > [0]PETSC ERROR: *** unknown floating point error occurred *** > > > [0]PETSC ERROR: The specific exception can be determined by running in a > > > debugger. When the > > > [0]PETSC ERROR: debugger traps the signal, the exception can be found > > > with fetestexcept(0x3e00) > > > [0]PETSC ERROR: where the result is a bitwise OR of the following flags: > > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 > > > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 > > > [0]PETSC ERROR: Try option -start_in_debugger > > > [0]PETSC ERROR: likely location of problem given in stack below > > > [0]PETSC ERROR: - Stack Frames > > > > > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not > > > available, > > > [0]PETSC ERROR: INSTEAD the line number of the start of the function > > > [0]PETSC ERROR: is given. > > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 > > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c > > > [0]PETSC ERROR: [0] PetscStrtod line 1964 > > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 > > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 > > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 > > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c > > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 > > > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c > > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 > > > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c > > > [0]PETSC ERROR: - Error Message > > > -- > > > [0]PETSC ERROR: Floating point
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Junchao and Barry, I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's robustify branch. Is this in master yet? If so, I'd like to get my branch merged to master, then merge Junchao's branch. Then us it. I think we were waiting for some refactoring from Karl to proceed. Anyway, I'm not sure how to proceed. Thanks, Mark On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao wrote: > > > > On Sat, Aug 31, 2019 at 8:04 PM Mark Adams wrote: > >> >> >> On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. >> wrote: >> >>> >>> Any explanation for why the scaling is much better for CPUs and than >>> GPUs? Is it the "extra" time needed for communication from the GPUs? >>> >> >> The GPU work is well load balanced so it weak scales perfectly. When you >> put that work in the CPU you get more perfectly scalable work added so it >> looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec. >> from the 1 node to 512 node case for both GPU and CPU, because this >> non-scaling is from communication that is the same for both cases >> >> >>> >>> Perhaps you could try the GPU version with Junchao's new MPI-aware >>> CUDA branch (in the gitlab merge requests) that can speed up the >>> communication from GPUs? >>> >> >> Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual? >> > > Use jsrun --smpiargs="-gpu" to enable IBM MPI's cuda-aware support, then > add -use_gpu_aware_mpi in option to let PETSc use that feature. > > >> >> >>> >>>Barry >>> >>> >>> > On Aug 30, 2019, at 11:56 AM, Mark Adams wrote: >>> > >>> > Here is some more weak scaling data with a fixed number of iterations >>> (I have given a test with the numerical problems to ORNL and they said they >>> would give it to Nvidia). >>> > >>> > I implemented an option to "spread" the reduced coarse grids across >>> the whole machine as opposed to a "compact" layout where active processes >>> are laid out in simple lexicographical order. This spread approach looks a >>> little better. >>> > >>> > Mark >>> > >>> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. >>> wrote: >>> > >>> > Ahh, PGI compiler, that explains it :-) >>> > >>> > Ok, thanks. Don't worry about the runs right now. We'll figure out >>> the fix. The code is just >>> > >>> > *a = (PetscReal)strtod(name,endptr); >>> > >>> > could be a compiler bug. >>> > >>> > >>> > >>> > >>> > > On Aug 14, 2019, at 9:23 PM, Mark Adams wrote: >>> > > >>> > > I am getting this error with single: >>> > > >>> > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 >>> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type >>> aijcusparse -fp_trap >>> > > [0] 81 global equations, 27 vertices >>> > > [0]PETSC ERROR: *** unknown floating point error occurred *** >>> > > [0]PETSC ERROR: The specific exception can be determined by running >>> in a debugger. When the >>> > > [0]PETSC ERROR: debugger traps the signal, the exception can be >>> found with fetestexcept(0x3e00) >>> > > [0]PETSC ERROR: where the result is a bitwise OR of the following >>> flags: >>> > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 >>> FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 >>> > > [0]PETSC ERROR: Try option -start_in_debugger >>> > > [0]PETSC ERROR: likely location of problem given in stack below >>> > > [0]PETSC ERROR: - Stack Frames >>> >>> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not >>> available, >>> > > [0]PETSC ERROR: INSTEAD the line number of the start of the >>> function >>> > > [0]PETSC ERROR: is given. >>> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c >>> > > [0]PETSC ERROR: [0] PetscStrtod line 1964 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c >>> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c >>> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c >>> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c >>> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 >>> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c >>> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 >>> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c >>> > > [0]PETSC ERROR: - Error Message >>> -- >>> > > [0]PETSC ERROR: Floating point exception >>> > > [0]PETSC ERROR: trapped floating point error >>> > > [0]PETSC ERROR: See >>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>> shooting. >>> > > [0]PETSC ERROR: Petsc Development GIT revision: >>> v3.11.3-1685-gd3eb2e1 GIT Date: 2019-08-13 06:33:29 -0400 >>> > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
On Sat, Aug 31, 2019 at 8:04 PM Mark Adams mailto:mfad...@lbl.gov>> wrote: On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. mailto:bsm...@mcs.anl.gov>> wrote: Any explanation for why the scaling is much better for CPUs and than GPUs? Is it the "extra" time needed for communication from the GPUs? The GPU work is well load balanced so it weak scales perfectly. When you put that work in the CPU you get more perfectly scalable work added so it looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from the 1 node to 512 node case for both GPU and CPU, because this non-scaling is from communication that is the same for both cases Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA branch (in the gitlab merge requests) that can speed up the communication from GPUs? Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual? Use jsrun --smpiargs="-gpu" to enable IBM MPI's cuda-aware support, then add -use_gpu_aware_mpi in option to let PETSc use that feature. Barry > On Aug 30, 2019, at 11:56 AM, Mark Adams > mailto:mfad...@lbl.gov>> wrote: > > Here is some more weak scaling data with a fixed number of iterations (I have > given a test with the numerical problems to ORNL and they said they would > give it to Nvidia). > > I implemented an option to "spread" the reduced coarse grids across the whole > machine as opposed to a "compact" layout where active processes are laid out > in simple lexicographical order. This spread approach looks a little better. > > Mark > > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. > mailto:bsm...@mcs.anl.gov>> wrote: > > Ahh, PGI compiler, that explains it :-) > > Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. > The code is just > > *a = (PetscReal)strtod(name,endptr); > > could be a compiler bug. > > > > > > On Aug 14, 2019, at 9:23 PM, Mark Adams > > mailto:mfad...@lbl.gov>> wrote: > > > > I am getting this error with single: > > > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 > > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type > > aijcusparse -fp_trap > > [0] 81 global equations, 27 vertices > > [0]PETSC ERROR: *** unknown floating point error occurred *** > > [0]PETSC ERROR: The specific exception can be determined by running in a > > debugger. When the > > [0]PETSC ERROR: debugger traps the signal, the exception can be found with > > fetestexcept(0x3e00) > > [0]PETSC ERROR: where the result is a bitwise OR of the following flags: > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 > > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 > > [0]PETSC ERROR: Try option -start_in_debugger > > [0]PETSC ERROR: likely location of problem given in stack below > > [0]PETSC ERROR: - Stack Frames > > > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > > [0]PETSC ERROR: INSTEAD the line number of the start of the function > > [0]PETSC ERROR: is given. > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c > > [0]PETSC ERROR: [0] PetscStrtod line 1964 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 > > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 > > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c > > [0]PETSC ERROR: - Error Message > > -- > > [0]PETSC ERROR: Floating point exception > > [0]PETSC ERROR: trapped floating point error > > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html > > for trouble shooting. > > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1 GIT > > Date: 2019-08-13 06:33:29 -0400 > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named > > h36n11 by adams Wed Aug 14 22:21:56 2019 > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC > > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" > > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 > > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc > > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis > > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 > > --with-debugging=1
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. wrote: > > Any explanation for why the scaling is much better for CPUs and than > GPUs? Is it the "extra" time needed for communication from the GPUs? > The GPU work is well load balanced so it weak scales perfectly. When you put that work in the CPU you get more perfectly scalable work added so it looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from the 1 node to 512 node case for both GPU and CPU, because this non-scaling is from communication that is the same for both cases > > Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA > branch (in the gitlab merge requests) that can speed up the communication > from GPUs? > Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual? > >Barry > > > > On Aug 30, 2019, at 11:56 AM, Mark Adams wrote: > > > > Here is some more weak scaling data with a fixed number of iterations (I > have given a test with the numerical problems to ORNL and they said they > would give it to Nvidia). > > > > I implemented an option to "spread" the reduced coarse grids across the > whole machine as opposed to a "compact" layout where active processes are > laid out in simple lexicographical order. This spread approach looks a > little better. > > > > Mark > > > > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. > wrote: > > > > Ahh, PGI compiler, that explains it :-) > > > > Ok, thanks. Don't worry about the runs right now. We'll figure out the > fix. The code is just > > > > *a = (PetscReal)strtod(name,endptr); > > > > could be a compiler bug. > > > > > > > > > > > On Aug 14, 2019, at 9:23 PM, Mark Adams wrote: > > > > > > I am getting this error with single: > > > > > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type > aijcusparse -fp_trap > > > [0] 81 global equations, 27 vertices > > > [0]PETSC ERROR: *** unknown floating point error occurred *** > > > [0]PETSC ERROR: The specific exception can be determined by running in > a debugger. When the > > > [0]PETSC ERROR: debugger traps the signal, the exception can be found > with fetestexcept(0x3e00) > > > [0]PETSC ERROR: where the result is a bitwise OR of the following > flags: > > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 > > > [0]PETSC ERROR: Try option -start_in_debugger > > > [0]PETSC ERROR: likely location of problem given in stack below > > > [0]PETSC ERROR: - Stack Frames > > > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not > available, > > > [0]PETSC ERROR: INSTEAD the line number of the start of the > function > > > [0]PETSC ERROR: is given. > > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c > > > [0]PETSC ERROR: [0] PetscStrtod line 1964 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c > > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c > > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c > > > [0]PETSC ERROR: - Error Message > -- > > > [0]PETSC ERROR: Floating point exception > > > [0]PETSC ERROR: trapped floating point error > > > [0]PETSC ERROR: See > https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. > > > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1 > GIT Date: 2019-08-13 06:33:29 -0400 > > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda > named h36n11 by adams Wed Aug 14 22:21:56 2019 > > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 > --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda > > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file > > > > -- > > > > > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. > wrote: > > > > > > Oh, doesn't even have
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Any explanation for why the scaling is much better for CPUs and than GPUs? Is it the "extra" time needed for communication from the GPUs? Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA branch (in the gitlab merge requests) that can speed up the communication from GPUs? Barry > On Aug 30, 2019, at 11:56 AM, Mark Adams wrote: > > Here is some more weak scaling data with a fixed number of iterations (I have > given a test with the numerical problems to ORNL and they said they would > give it to Nvidia). > > I implemented an option to "spread" the reduced coarse grids across the whole > machine as opposed to a "compact" layout where active processes are laid out > in simple lexicographical order. This spread approach looks a little better. > > Mark > > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. wrote: > > Ahh, PGI compiler, that explains it :-) > > Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. > The code is just > > *a = (PetscReal)strtod(name,endptr); > > could be a compiler bug. > > > > > > On Aug 14, 2019, at 9:23 PM, Mark Adams wrote: > > > > I am getting this error with single: > > > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 > > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type > > aijcusparse -fp_trap > > [0] 81 global equations, 27 vertices > > [0]PETSC ERROR: *** unknown floating point error occurred *** > > [0]PETSC ERROR: The specific exception can be determined by running in a > > debugger. When the > > [0]PETSC ERROR: debugger traps the signal, the exception can be found with > > fetestexcept(0x3e00) > > [0]PETSC ERROR: where the result is a bitwise OR of the following flags: > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 > > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 > > [0]PETSC ERROR: Try option -start_in_debugger > > [0]PETSC ERROR: likely location of problem given in stack below > > [0]PETSC ERROR: - Stack Frames > > > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > > [0]PETSC ERROR: INSTEAD the line number of the start of the function > > [0]PETSC ERROR: is given. > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c > > [0]PETSC ERROR: [0] PetscStrtod line 1964 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 > > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 > > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 > > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c > > [0]PETSC ERROR: - Error Message > > -- > > [0]PETSC ERROR: Floating point exception > > [0]PETSC ERROR: trapped floating point error > > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html > > for trouble shooting. > > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1 GIT > > Date: 2019-08-13 06:33:29 -0400 > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named > > h36n11 by adams Wed Aug 14 22:21:56 2019 > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC > > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" > > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 > > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc > > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis > > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 > > --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file > > -- > > > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. wrote: > > > > Oh, doesn't even have to be that large. We just need to be able to look > > at the flop rates (as a surrogate for run times) and compare with the > > previous runs. So long as the size per process is pretty much the same that > > is good enough. > > > >Barry > > > > > > > On Aug 14, 2019, at 8:45 PM, Mark Adams wrote: > > > > > > I can run single, I just can't scale up. But I can use like 1500 > > > processors. > > > > > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. > > > wrote: > > > > > > Oh, are all your integers 8 bytes? Even on one
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Ahh, PGI compiler, that explains it :-) Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. The code is just *a = (PetscReal)strtod(name,endptr); could be a compiler bug. > On Aug 14, 2019, at 9:23 PM, Mark Adams wrote: > > I am getting this error with single: > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type > aijcusparse -fp_trap > [0] 81 global equations, 27 vertices > [0]PETSC ERROR: *** unknown floating point error occurred *** > [0]PETSC ERROR: The specific exception can be determined by running in a > debugger. When the > [0]PETSC ERROR: debugger traps the signal, the exception can be found with > fetestexcept(0x3e00) > [0]PETSC ERROR: where the result is a bitwise OR of the following flags: > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 > [0]PETSC ERROR: Try option -start_in_debugger > [0]PETSC ERROR: likely location of problem given in stack below > [0]PETSC ERROR: - Stack Frames > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [0]PETSC ERROR: INSTEAD the line number of the start of the function > [0]PETSC ERROR: is given. > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c > [0]PETSC ERROR: [0] PetscStrtod line 1964 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c > [0]PETSC ERROR: - Error Message > -- > [0]PETSC ERROR: Floating point exception > [0]PETSC ERROR: trapped floating point error > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for > trouble shooting. > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1 GIT > Date: 2019-08-13 06:33:29 -0400 > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named > h36n11 by adams Wed Aug 14 22:21:56 2019 > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" FOPTFLAGS="-g > -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc CUDAFLAGS="-ccbin > pgc++" --download-metis --download-parmetis --download-fblaslapack --with-x=0 > --with-64-bit-indices=0 --with-debugging=1 > PETSC_ARCH=arch-summit-dbg-single-pgi-cuda > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file > -- > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. wrote: > > Oh, doesn't even have to be that large. We just need to be able to look at > the flop rates (as a surrogate for run times) and compare with the previous > runs. So long as the size per process is pretty much the same that is good > enough. > >Barry > > > > On Aug 14, 2019, at 8:45 PM, Mark Adams wrote: > > > > I can run single, I just can't scale up. But I can use like 1500 processors. > > > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. wrote: > > > > Oh, are all your integers 8 bytes? Even on one node? > > > > Once Karl's new middleware is in place we should see about reducing to 4 > > bytes on the GPU. > > > >Barry > > > > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams wrote: > > > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 > > > byte integers ... I could use 32 bit ints and just not scale out. > > > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. > > > wrote: > > > > > > Mark, > > > > > >Oh, I don't even care if it converges, just put in a fixed number of > > > iterations. The idea is to just get a baseline of the possible > > > improvement. > > > > > > ECP is literally dropping millions into research on "multi precision" > > > computations on GPUs, we need to have some actual numbers for the best > > > potential benefit to determine how much we invest in further > > > investigating it, or not. > > > > > > I am not expressing any opinions on the approach, we are just in the > > > fact gathering stage. > > > > > > > > >Barry > > > > > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
I am getting this error with single: 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type aijcusparse -fp_trap [0] 81 global equations, 27 vertices [0]PETSC ERROR: *** unknown floating point error occurred *** [0]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3e00) [0]PETSC ERROR: where the result is a bitwise OR of the following flags: [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200 [0]PETSC ERROR: Try option -start_in_debugger [0]PETSC ERROR: likely location of problem given in stack below [0]PETSC ERROR: - Stack Frames [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [0]PETSC ERROR: INSTEAD the line number of the start of the function [0]PETSC ERROR: is given. [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c [0]PETSC ERROR: [0] PetscStrtod line 1964 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c [0]PETSC ERROR: [0] KSPSetFromOptions line 329 /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c [0]PETSC ERROR: [0] SNESSetFromOptions line 869 /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c [0]PETSC ERROR: - Error Message -- [0]PETSC ERROR: Floating point exception [0]PETSC ERROR: trapped floating point error [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1 GIT Date: 2019-08-13 06:33:29 -0400 [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named h36n11 by adams Wed Aug 14 22:21:56 2019 [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis --download-fblaslapack --with-x=0 --with-64-bit-indices=0 --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file -- On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. wrote: > > Oh, doesn't even have to be that large. We just need to be able to look > at the flop rates (as a surrogate for run times) and compare with the > previous runs. So long as the size per process is pretty much the same that > is good enough. > >Barry > > > > On Aug 14, 2019, at 8:45 PM, Mark Adams wrote: > > > > I can run single, I just can't scale up. But I can use like 1500 > processors. > > > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. > wrote: > > > > Oh, are all your integers 8 bytes? Even on one node? > > > > Once Karl's new middleware is in place we should see about reducing to > 4 bytes on the GPU. > > > >Barry > > > > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams wrote: > > > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 > byte integers ... I could use 32 bit ints and just not scale out. > > > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. > wrote: > > > > > > Mark, > > > > > >Oh, I don't even care if it converges, just put in a fixed number > of iterations. The idea is to just get a baseline of the possible > improvement. > > > > > > ECP is literally dropping millions into research on "multi > precision" computations on GPUs, we need to have some actual numbers for > the best potential benefit to determine how much we invest in further > investigating it, or not. > > > > > > I am not expressing any opinions on the approach, we are just in > the fact gathering stage. > > > > > > > > >Barry > > > > > > > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams wrote: > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: > > > > > > > > Mark, > > > > > > > >Would you be able to make one run using single precision? Just > single everywhere since that is all we support currently? > > > > > > > > > > > > Experience in engineering at least is single does not work for FE > elasticity. I have tried it many years ago and have heard this from
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Oh, doesn't even have to be that large. We just need to be able to look at the flop rates (as a surrogate for run times) and compare with the previous runs. So long as the size per process is pretty much the same that is good enough. Barry > On Aug 14, 2019, at 8:45 PM, Mark Adams wrote: > > I can run single, I just can't scale up. But I can use like 1500 processors. > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. wrote: > > Oh, are all your integers 8 bytes? Even on one node? > > Once Karl's new middleware is in place we should see about reducing to 4 > bytes on the GPU. > >Barry > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams wrote: > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte > > integers ... I could use 32 bit ints and just not scale out. > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. wrote: > > > > Mark, > > > >Oh, I don't even care if it converges, just put in a fixed number of > > iterations. The idea is to just get a baseline of the possible improvement. > > > > ECP is literally dropping millions into research on "multi precision" > > computations on GPUs, we need to have some actual numbers for the best > > potential benefit to determine how much we invest in further investigating > > it, or not. > > > > I am not expressing any opinions on the approach, we are just in the > > fact gathering stage. > > > > > >Barry > > > > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams wrote: > > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > > > wrote: > > > > > > Mark, > > > > > >Would you be able to make one run using single precision? Just single > > > everywhere since that is all we support currently? > > > > > > > > > Experience in engineering at least is single does not work for FE > > > elasticity. I have tried it many years ago and have heard this from > > > others. This problem is pretty simple other than using Q2. I suppose I > > > could try it, but just be aware the FE people might say that single sucks. > > > > > >The results will give us motivation (or anti-motivation) to have > > > support for running KSP (or PC (or Mat) in single precision while the > > > simulation is double. > > > > > >Thanks. > > > > > > Barry > > > > > > For example if the GPU speed on KSP is a factor of 3 over the double on > > > GPUs this is serious motivation. > > > > > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > > > > speedup with 98K dof/proc (3D Q2 elasticity). > > > > > > > > This is weak scaling of a solve. There is growth in iteration count > > > > folded in here. I should put rtol in the title and/or run a fixed > > > > number of iterations and make it clear in the title. > > > > > > > > Comments welcome. > > > > > > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
I can run single, I just can't scale up. But I can use like 1500 processors. On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. wrote: > > Oh, are all your integers 8 bytes? Even on one node? > > Once Karl's new middleware is in place we should see about reducing to 4 > bytes on the GPU. > >Barry > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams wrote: > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 > byte integers ... I could use 32 bit ints and just not scale out. > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. > wrote: > > > > Mark, > > > >Oh, I don't even care if it converges, just put in a fixed number of > iterations. The idea is to just get a baseline of the possible improvement. > > > > ECP is literally dropping millions into research on "multi > precision" computations on GPUs, we need to have some actual numbers for > the best potential benefit to determine how much we invest in further > investigating it, or not. > > > > I am not expressing any opinions on the approach, we are just in the > fact gathering stage. > > > > > >Barry > > > > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams wrote: > > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: > > > > > > Mark, > > > > > >Would you be able to make one run using single precision? Just > single everywhere since that is all we support currently? > > > > > > > > > Experience in engineering at least is single does not work for FE > elasticity. I have tried it many years ago and have heard this from others. > This problem is pretty simple other than using Q2. I suppose I could try > it, but just be aware the FE people might say that single sucks. > > > > > >The results will give us motivation (or anti-motivation) to have > support for running KSP (or PC (or Mat) in single precision while the > simulation is double. > > > > > >Thanks. > > > > > > Barry > > > > > > For example if the GPU speed on KSP is a factor of 3 over the double > on GPUs this is serious motivation. > > > > > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x > GPU speedup with 98K dof/proc (3D Q2 elasticity). > > > > > > > > This is weak scaling of a solve. There is growth in iteration count > folded in here. I should put rtol in the title and/or run a fixed number of > iterations and make it clear in the title. > > > > > > > > Comments welcome. > > > > > > > > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Oh, are all your integers 8 bytes? Even on one node? Once Karl's new middleware is in place we should see about reducing to 4 bytes on the GPU. Barry > On Aug 14, 2019, at 7:44 PM, Mark Adams wrote: > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte > integers ... I could use 32 bit ints and just not scale out. > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. wrote: > > Mark, > >Oh, I don't even care if it converges, just put in a fixed number of > iterations. The idea is to just get a baseline of the possible improvement. > > ECP is literally dropping millions into research on "multi precision" > computations on GPUs, we need to have some actual numbers for the best > potential benefit to determine how much we invest in further investigating > it, or not. > > I am not expressing any opinions on the approach, we are just in the fact > gathering stage. > > >Barry > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams wrote: > > > > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > > > Mark, > > > >Would you be able to make one run using single precision? Just single > > everywhere since that is all we support currently? > > > > > > Experience in engineering at least is single does not work for FE > > elasticity. I have tried it many years ago and have heard this from others. > > This problem is pretty simple other than using Q2. I suppose I could try > > it, but just be aware the FE people might say that single sucks. > > > >The results will give us motivation (or anti-motivation) to have support > > for running KSP (or PC (or Mat) in single precision while the simulation > > is double. > > > >Thanks. > > > > Barry > > > > For example if the GPU speed on KSP is a factor of 3 over the double on > > GPUs this is serious motivation. > > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > > > speedup with 98K dof/proc (3D Q2 elasticity). > > > > > > This is weak scaling of a solve. There is growth in iteration count > > > folded in here. I should put rtol in the title and/or run a fixed number > > > of iterations and make it clear in the title. > > > > > > Comments welcome. > > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte integers ... I could use 32 bit ints and just not scale out. On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. wrote: > > Mark, > >Oh, I don't even care if it converges, just put in a fixed number of > iterations. The idea is to just get a baseline of the possible improvement. > > ECP is literally dropping millions into research on "multi precision" > computations on GPUs, we need to have some actual numbers for the best > potential benefit to determine how much we invest in further investigating > it, or not. > > I am not expressing any opinions on the approach, we are just in the > fact gathering stage. > > >Barry > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams wrote: > > > > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: > > > > Mark, > > > >Would you be able to make one run using single precision? Just single > everywhere since that is all we support currently? > > > > > > Experience in engineering at least is single does not work for FE > elasticity. I have tried it many years ago and have heard this from others. > This problem is pretty simple other than using Q2. I suppose I could try > it, but just be aware the FE people might say that single sucks. > > > >The results will give us motivation (or anti-motivation) to have > support for running KSP (or PC (or Mat) in single precision while the > simulation is double. > > > >Thanks. > > > > Barry > > > > For example if the GPU speed on KSP is a factor of 3 over the double on > GPUs this is serious motivation. > > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > speedup with 98K dof/proc (3D Q2 elasticity). > > > > > > This is weak scaling of a solve. There is growth in iteration count > folded in here. I should put rtol in the title and/or run a fixed number of > iterations and make it clear in the title. > > > > > > Comments welcome. > > > > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
FYI, this test has a smooth (polynomial) body force and it runs a convergence study. On Wed, Aug 14, 2019 at 6:15 PM Brad Aagaard via petsc-dev < petsc-dev@mcs.anl.gov> wrote: > Q2 is often useful in problems with body forces (such as gravitational > body forces), which tend to have linear variations in stress. > > On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote: > > > > > > Do you have any applications that specifically want Q2 (versus Q1) > > elasticity or have some test problems that would benefit? > > > > > > No, I'm just trying to push things. >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
"Smith, Barry F." writes: >> On Aug 14, 2019, at 5:58 PM, Jed Brown wrote: >> >> "Smith, Barry F." writes: >> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: Mark Adams via petsc-dev writes: > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: > >> >> Mark, >> >> Would you be able to make one run using single precision? Just single >> everywhere since that is all we support currently? >> >> > Experience in engineering at least is single does not work for FE > elasticity. I have tried it many years ago and have heard this from > others. > This problem is pretty simple other than using Q2. I suppose I could try > it, but just be aware the FE people might say that single sucks. When they say that single sucks, is it for the definition of the operator or the preconditioner? As point of reference, we can apply Q2 elasticity operators in double precision at nearly a billion dofs/second per GPU. >>> >>> And in single you get what? >> >> I don't have exact numbers, but <2x faster on V100, and it sort of >> doesn't matter because preconditioning cost will dominate. > >When using block formats a much higher percentage of the bandwidth goes to > moving the double precision matrix entries so switching to single could > conceivably benefitup to almost a factor of two. > > Depending on the matrix structure perhaps the column indices could be > handled by a shift and short j indices. Or 2 shifts and 2 sets of j indices Shorts are a problem, but a lot of matrices are actually quite compressible if you subtract the row from all the column indices. I've done some experiments using zstd and the CPU decode rate is competitive to better than DRAM bandwidth. But that gives up random access, which seems important for vectorization. Maybe someone who knows more about decompression on GPUs can comment? >> The big win >> of single is on consumer-grade GPUs, which DOE doesn't install and >> NVIDIA forbids to be used in data centers (because they're so >> cost-effective ;-)). > >DOE LCFs are not our only customers. Cheap-o engineering professors >might stack a bunch of consumer grade in their lab, would they >benefit? Satish's basement could hold a great deal of consumer >grades. Fair point. Time is also important so most companies buy the more expensive hardware on the assumption it means less frequent problems (due to lack of ECC, etc.).
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
> On Aug 14, 2019, at 5:58 PM, Jed Brown wrote: > > "Smith, Barry F." writes: > >>> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: >>> >>> Mark Adams via petsc-dev writes: >>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > Mark, > > Would you be able to make one run using single precision? Just single > everywhere since that is all we support currently? > > Experience in engineering at least is single does not work for FE elasticity. I have tried it many years ago and have heard this from others. This problem is pretty simple other than using Q2. I suppose I could try it, but just be aware the FE people might say that single sucks. >>> >>> When they say that single sucks, is it for the definition of the >>> operator or the preconditioner? >>> >>> As point of reference, we can apply Q2 elasticity operators in double >>> precision at nearly a billion dofs/second per GPU. >> >> And in single you get what? > > I don't have exact numbers, but <2x faster on V100, and it sort of > doesn't matter because preconditioning cost will dominate. When using block formats a much higher percentage of the bandwidth goes to moving the double precision matrix entries so switching to single could conceivably benefitup to almost a factor of two. Depending on the matrix structure perhaps the column indices could be handled by a shift and short j indices. Or 2 shifts and 2 sets of j indices > The big win > of single is on consumer-grade GPUs, which DOE doesn't install and > NVIDIA forbids to be used in data centers (because they're so > cost-effective ;-)). DOE LCFs are not our only customers. Cheap-o engineering professors might stack a bunch of consumer grade in their lab, would they benefit? Satish's basement could hold a great deal of consumer grades. > >>> I'm skeptical of big wins in preconditioning (especially setup) due to >>> the cost and irregularity of indexing being large compared to the >>> bandwidth cost of the floating point values.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
> On Aug 14, 2019, at 3:36 PM, Mark Adams wrote: > > > > On Wed, Aug 14, 2019 at 3:37 PM Jed Brown wrote: > Mark Adams via petsc-dev writes: > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > > >> > >> Mark, > >> > >>Would you be able to make one run using single precision? Just single > >> everywhere since that is all we support currently? > >> > >> > > Experience in engineering at least is single does not work for FE > > elasticity. I have tried it many years ago and have heard this from others. > > This problem is pretty simple other than using Q2. I suppose I could try > > it, but just be aware the FE people might say that single sucks. > > When they say that single sucks, is it for the definition of the > operator or the preconditioner? > > Operator. > > And "ve seen GMRES stagnate when using single in communication in parallel > Gauss-Seidel. Roundoff is nonlinear. When it is specific places in the algorithm that require more precision this can potentially be added. For example compute reductions in double. Even "delicate" parts of the function/Jacobian evaluation. Is it worth the bother? Apparently it is for the people with suitcases of money to hand out. > > > As point of reference, we can apply Q2 elasticity operators in double > precision at nearly a billion dofs/second per GPU. > > I'm skeptical of big wins in preconditioning (especially setup) due to > the cost and irregularity of indexing being large compared to the > bandwidth cost of the floating point values.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
"Smith, Barry F." writes: >> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: >> >> Mark Adams via petsc-dev writes: >> >>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: >>> Mark, Would you be able to make one run using single precision? Just single everywhere since that is all we support currently? >>> Experience in engineering at least is single does not work for FE >>> elasticity. I have tried it many years ago and have heard this from others. >>> This problem is pretty simple other than using Q2. I suppose I could try >>> it, but just be aware the FE people might say that single sucks. >> >> When they say that single sucks, is it for the definition of the >> operator or the preconditioner? >> >> As point of reference, we can apply Q2 elasticity operators in double >> precision at nearly a billion dofs/second per GPU. > > And in single you get what? I don't have exact numbers, but <2x faster on V100, and it sort of doesn't matter because preconditioning cost will dominate. The big win of single is on consumer-grade GPUs, which DOE doesn't install and NVIDIA forbids to be used in data centers (because they're so cost-effective ;-)). >> I'm skeptical of big wins in preconditioning (especially setup) due to >> the cost and irregularity of indexing being large compared to the >> bandwidth cost of the floating point values.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: > > Mark Adams via petsc-dev writes: > >> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: >> >>> >>> Mark, >>> >>> Would you be able to make one run using single precision? Just single >>> everywhere since that is all we support currently? >>> >>> >> Experience in engineering at least is single does not work for FE >> elasticity. I have tried it many years ago and have heard this from others. >> This problem is pretty simple other than using Q2. I suppose I could try >> it, but just be aware the FE people might say that single sucks. > > When they say that single sucks, is it for the definition of the > operator or the preconditioner? > > As point of reference, we can apply Q2 elasticity operators in double > precision at nearly a billion dofs/second per GPU. And in single you get what? > > I'm skeptical of big wins in preconditioning (especially setup) due to > the cost and irregularity of indexing being large compared to the > bandwidth cost of the floating point values.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Mark, Oh, I don't even care if it converges, just put in a fixed number of iterations. The idea is to just get a baseline of the possible improvement. ECP is literally dropping millions into research on "multi precision" computations on GPUs, we need to have some actual numbers for the best potential benefit to determine how much we invest in further investigating it, or not. I am not expressing any opinions on the approach, we are just in the fact gathering stage. Barry > On Aug 14, 2019, at 2:27 PM, Mark Adams wrote: > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > Mark, > >Would you be able to make one run using single precision? Just single > everywhere since that is all we support currently? > > > Experience in engineering at least is single does not work for FE elasticity. > I have tried it many years ago and have heard this from others. This problem > is pretty simple other than using Q2. I suppose I could try it, but just be > aware the FE people might say that single sucks. > >The results will give us motivation (or anti-motivation) to have support > for running KSP (or PC (or Mat) in single precision while the simulation is > double. > >Thanks. > > Barry > > For example if the GPU speed on KSP is a factor of 3 over the double on GPUs > this is serious motivation. > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > > speedup with 98K dof/proc (3D Q2 elasticity). > > > > This is weak scaling of a solve. There is growth in iteration count folded > > in here. I should put rtol in the title and/or run a fixed number of > > iterations and make it clear in the title. > > > > Comments welcome. > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Here is the times for KSPSolve on one node with 2,280,285 equations. These nodes seem to have 42 cores. There are 6 "devices" (GPUs) and 7 core attached to the device. The anomalous 28 core result could be from only using 4 "devices". I figure I will use 36 cores for now. I should really do this with a lot of processors to include MPI communication... NP KSPSolve 205.6634e+00 244.7382e+00 286.0349e+00 324.7543e+00 364.2574e+00 424.2022e+00
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Brad Aagaard via petsc-dev writes: > Q2 is often useful in problems with body forces (such as gravitational > body forces), which tend to have linear variations in stress. It's similar on the free-surface Stokes side, where pressure has a linear gradient and must be paired with a stable velocity space. Regarding elasticity, it would be useful to have collect some application problems where Q2 shows a big advantage. We should be able to solve Q2 at the same or lower cost per dof to Q1 (multigrid for this case isn't off-the-shelf at present, but it's something we're working on). > On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote: >> >> >> Do you have any applications that specifically want Q2 (versus Q1) >> elasticity or have some test problems that would benefit? >> >> >> No, I'm just trying to push things.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Q2 is often useful in problems with body forces (such as gravitational body forces), which tend to have linear variations in stress. On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote: Do you have any applications that specifically want Q2 (versus Q1) elasticity or have some test problems that would benefit? No, I'm just trying to push things.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
> > > > Do you have any applications that specifically want Q2 (versus Q1) > elasticity or have some test problems that would benefit? > > No, I'm just trying to push things.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Mark Adams writes: > On Wed, Aug 14, 2019 at 3:37 PM Jed Brown wrote: > >> Mark Adams via petsc-dev writes: >> >> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. >> wrote: >> > >> >> >> >> Mark, >> >> >> >>Would you be able to make one run using single precision? Just single >> >> everywhere since that is all we support currently? >> >> >> >> >> > Experience in engineering at least is single does not work for FE >> > elasticity. I have tried it many years ago and have heard this from >> others. >> > This problem is pretty simple other than using Q2. I suppose I could try >> > it, but just be aware the FE people might say that single sucks. >> >> When they say that single sucks, is it for the definition of the >> operator or the preconditioner? >> > > Operator. > > And "ve seen GMRES stagnate when using single in communication in parallel > Gauss-Seidel. Roundoff is nonlinear. Fair; single may still be useful in the preconditioner while using double for operator and Krylov. Do you have any applications that specifically want Q2 (versus Q1) elasticity or have some test problems that would benefit? >> As point of reference, we can apply Q2 elasticity operators in double >> precision at nearly a billion dofs/second per GPU. > > >> I'm skeptical of big wins in preconditioning (especially setup) due to >> the cost and irregularity of indexing being large compared to the >> bandwidth cost of the floating point values. >>
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F. wrote: > > Mark, > > This is great, we can study these for months. > > 1) At the top of the plots you say SNES but that can't be right, there is > no way it is getting such speed ups for the entire SNES solve since the > Jacobians are CPUs and take much of the time. Do you mean the KSP part of > the SNES solve? > It uses KSPONLY. And solve times are KSPSolve with KSPSetUp called before. > > 2) For the case of a bit more than 1000 processes the speedup with GPUs is > fantastic, more than 6? > I did not see that one, but it is plausible and there is some noise in this data. The largest solve had a speedup of about 4x. > > 3) People will ask about runs using all 48 CPUs, since they are there it > is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to > memory bandwidth limits 48 won't be much better than 24 but you need it in > your back pocket for completeness. > > Ah, good point. I just cut and paste but I should run a little test and see where it saturates. > 4) From the table > > KSPSolve 1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 > 8.3e+01 0 0 4 0 3 10 57 97 52 81 19113494114 3.06e-01 129 > 1.38e-01 84 > PCApply 17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 > 3.4e+01 0 0 3 0 1 8 49 81 44 33 19684007 98 2.58e-01 113 > 1.19e-01 81 > > only 84 percent of the total flops in the KSPSolve are on the GPU and only > 81 for the PCApply() where are the rest? MatMult() etc are doing 100 > percent on the GPU, MatSolve on the coarsest level should be tiny and not > taking 19 percent of the flops? > > That is the smallest test with 3465 equations on 24 cores. the R and P and coarse grid are on the CPU. Look at larger tests. > Thanks > >Barry > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > speedup with 98K dof/proc (3D Q2 elasticity). > > > > This is weak scaling of a solve. There is growth in iteration count > folded in here. I should put rtol in the title and/or run a fixed number of > iterations and make it clear in the title. > > > > Comments welcome. > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
On Wed, Aug 14, 2019 at 3:37 PM Jed Brown wrote: > Mark Adams via petsc-dev writes: > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: > > > >> > >> Mark, > >> > >>Would you be able to make one run using single precision? Just single > >> everywhere since that is all we support currently? > >> > >> > > Experience in engineering at least is single does not work for FE > > elasticity. I have tried it many years ago and have heard this from > others. > > This problem is pretty simple other than using Q2. I suppose I could try > > it, but just be aware the FE people might say that single sucks. > > When they say that single sucks, is it for the definition of the > operator or the preconditioner? > Operator. And "ve seen GMRES stagnate when using single in communication in parallel Gauss-Seidel. Roundoff is nonlinear. > > As point of reference, we can apply Q2 elasticity operators in double > precision at nearly a billion dofs/second per GPU. > I'm skeptical of big wins in preconditioning (especially setup) due to > the cost and irregularity of indexing being large compared to the > bandwidth cost of the floating point values. >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Mark Adams via petsc-dev writes: > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > >> >> Mark, >> >>Would you be able to make one run using single precision? Just single >> everywhere since that is all we support currently? >> >> > Experience in engineering at least is single does not work for FE > elasticity. I have tried it many years ago and have heard this from others. > This problem is pretty simple other than using Q2. I suppose I could try > it, but just be aware the FE people might say that single sucks. When they say that single sucks, is it for the definition of the operator or the preconditioner? As point of reference, we can apply Q2 elasticity operators in double precision at nearly a billion dofs/second per GPU. I'm skeptical of big wins in preconditioning (especially setup) due to the cost and irregularity of indexing being large compared to the bandwidth cost of the floating point values.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > Mark, > >Would you be able to make one run using single precision? Just single > everywhere since that is all we support currently? > > Experience in engineering at least is single does not work for FE elasticity. I have tried it many years ago and have heard this from others. This problem is pretty simple other than using Q2. I suppose I could try it, but just be aware the FE people might say that single sucks. >The results will give us motivation (or anti-motivation) to have > support for running KSP (or PC (or Mat) in single precision while the > simulation is double. > >Thanks. > > Barry > > For example if the GPU speed on KSP is a factor of 3 over the double on > GPUs this is serious motivation. > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > speedup with 98K dof/proc (3D Q2 elasticity). > > > > This is weak scaling of a solve. There is growth in iteration count > folded in here. I should put rtol in the title and/or run a fixed number of > iterations and make it clear in the title. > > > > Comments welcome. > > > > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Mark, Would you be able to make one run using single precision? Just single everywhere since that is all we support currently? The results will give us motivation (or anti-motivation) to have support for running KSP (or PC (or Mat) in single precision while the simulation is double. Thanks. Barry For example if the GPU speed on KSP is a factor of 3 over the double on GPUs this is serious motivation. > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > speedup with 98K dof/proc (3D Q2 elasticity). > > This is weak scaling of a solve. There is growth in iteration count folded in > here. I should put rtol in the title and/or run a fixed number of iterations > and make it clear in the title. > > Comments welcome. >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Mark, This is great, we can study these for months. 1) At the top of the plots you say SNES but that can't be right, there is no way it is getting such speed ups for the entire SNES solve since the Jacobians are CPUs and take much of the time. Do you mean the KSP part of the SNES solve? 2) For the case of a bit more than 1000 processes the speedup with GPUs is fantastic, more than 6? 3) People will ask about runs using all 48 CPUs, since they are there it is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to memory bandwidth limits 48 won't be much better than 24 but you need it in your back pocket for completeness. 4) From the table KSPSolve 1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 8.3e+01 0 0 4 0 3 10 57 97 52 81 19113494114 3.06e-01 129 1.38e-01 84 PCApply 17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 3.4e+01 0 0 3 0 1 8 49 81 44 33 19684007 98 2.58e-01 113 1.19e-01 81 only 84 percent of the total flops in the KSPSolve are on the GPU and only 81 for the PCApply() where are the rest? MatMult() etc are doing 100 percent on the GPU, MatSolve on the coarsest level should be tiny and not taking 19 percent of the flops? Thanks Barry > On Aug 14, 2019, at 12:45 PM, Mark Adams wrote: > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > speedup with 98K dof/proc (3D Q2 elasticity). > > This is weak scaling of a solve. There is growth in iteration count folded in > here. I should put rtol in the title and/or run a fixed number of iterations > and make it clear in the title. > > Comments welcome. >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
> > > 3) Is comparison between pointers appropriate? For example if (dptr != > zarray) { is scary if some arrays are zero length how do we know what the > pointer value will be? > > Yes, you need to consider these cases, which is kind of error prone. Also, I think merging transpose,and not,is a good idea because the way the code is setup it is easy. You just grab a different cached object and keep your rmaps and cmaps straight,I think.
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
My concern is 1) is it actually optimally efficient for all cases? This kind of stuff, IMHO if (yy) { if (dptr != zarray) { ierr = VecCopy_SeqCUDA(yy,zz);CHKERRQ(ierr); } else if (zz != yy) { ierr = VecAXPY_SeqCUDA(zz,1.0,yy);CHKERRQ(ierr); } } else if (dptr != zarray) { ierr = VecSet_SeqCUDA(zz,0);CHKERRQ(ierr); } means it is not. It is launching additional kernels and looping over arrays more times then if each form was optimized for its one case. 2) is it utilizing VecCUDAGetArrayWrite() when possible? No, it uses VecCUDAGetArray() which for certain configurations means copying from CPU stuff that will immediately be overwritten. Sometimes it can use VecCUDAGetArrayWrite() sometimes it can't, code has handle each case properly. 3) Is comparison between pointers appropriate? For example if (dptr != zarray) { is scary if some arrays are zero length how do we know what the pointer value will be? I am not saying it is totally impossible to have a single routine that optimally efficiently did all cases: MatMult, yy == zz, etc but the resulting code will be real complex with lots of if()s and difficult to understand and maintain; just tracing through all cases and insuring each is optimal is nontrivial. Barry > On Jul 10, 2019, at 11:01 AM, Stefano Zampini > wrote: > > Barry, > > I think having a single code instead of three different, quasi similar, > versions is less fragile ( I admit, once you get the logic correct...) > Also, it conforms with the standard for spmv that implements alpha * A * x + > beta * b > The easiest fix is the following: > > Rename MatMultAdd_ into MatMultKernel_Private and add an extra boolean to > control the transpose operation > then, you can reuse the same complicated code I have wrote, just by selecting > the proper cusparse object (matstructT or matstruct) > > > Il giorno mer 10 lug 2019 alle ore 18:16 Smith, Barry F. > ha scritto: > >In the long run I would like to see smaller specialized chunks of code > (with a bit of duplication between them) instead of highly overloaded > routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, > for multiple add alone and for multiple add with sparse format. Trying to get > all the cases right (performance and correctness for the everything at once > is unnecessary and risky). Having possible zero size objects (and hence null > pointers) doesn't help the complex logic > > >Barry > > > > On Jul 10, 2019, at 10:06 AM, Mark Adams wrote: > > > > Thanks, you made several changes here, including switches with the > > workvector size. I guess I should import this logic to the transpose > > method(s), except for the yy==NULL branches ... > > > > MatMult_ calls MatMultAdd with yy=0, but the transpose version have their > > own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. > > > > Thanks again, > > Mark > > > > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini > > wrote: > > Mark, > > > > if the difference is on lvec, I suspect the bug has to do with compressed > > row storage. I have fixed a similar bug in MatMult. > > you want to check cusparsestruct->workVector->size() against A->cmap->n. > > > > Stefano > > > > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev > > ha scritto: > > > > > > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. wrote: > > > > ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); > > if (nt != A->rmap->n) > > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A > > (%D) and xx (%D)",A->rmap->n,nt); > > ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); > > ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); > > > > So the xx on the GPU appears ok? > > > > The norm is correct and ... > > > > The a->B appears ok? > > > > yes > > > > But on process 1 the result a->lvec is wrong? > > > > yes > > > > > > How do you look at the a->lvec? Do you copy it to the CPU and print it? > > > > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I > > should copy it. Maybe I should make a CUDA version of these methods? > > > > > > ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); > > ierr = > > VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > > ierr = > > VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > > ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); > > > > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? > > > > This is where I have been digging around an printing stuff. > > > > > > Are you sure the problem isn't related to the "stream business"? > > > > I don't know what that is but I have played around with adding > > cudaDeviceSynchronize > > > > > > /* This multiplication sequence is different sequence > > than the CPU version. In particular, the
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Yea, I agree. Once this is working, I'll go back and split MatMultAdd, etc. On Wed, Jul 10, 2019 at 11:16 AM Smith, Barry F. wrote: > >In the long run I would like to see smaller specialized chunks of code > (with a bit of duplication between them) instead of highly overloaded > routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple > alone, for multiple add alone and for multiple add with sparse format. > Trying to get all the cases right (performance and correctness for the > everything at once is unnecessary and risky). Having possible zero size > objects (and hence null pointers) doesn't help the complex logic > > >Barry > > > > On Jul 10, 2019, at 10:06 AM, Mark Adams wrote: > > > > Thanks, you made several changes here, including switches with the > workvector size. I guess I should import this logic to the transpose > method(s), except for the yy==NULL branches ... > > > > MatMult_ calls MatMultAdd with yy=0, but the transpose version have > their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. > > > > Thanks again, > > Mark > > > > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini < > stefano.zamp...@gmail.com> wrote: > > Mark, > > > > if the difference is on lvec, I suspect the bug has to do with > compressed row storage. I have fixed a similar bug in MatMult. > > you want to check cusparsestruct->workVector->size() against A->cmap->n. > > > > Stefano > > > > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> ha scritto: > > > > > > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. > wrote: > > > > ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); > > if (nt != A->rmap->n) > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A > (%D) and xx (%D)",A->rmap->n,nt); > > ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); > > ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); > > > > So the xx on the GPU appears ok? > > > > The norm is correct and ... > > > > The a->B appears ok? > > > > yes > > > > But on process 1 the result a->lvec is wrong? > > > > yes > > > > > > How do you look at the a->lvec? Do you copy it to the CPU and print it? > > > > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I > should copy it. Maybe I should make a CUDA version of these methods? > > > > > > ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); > > ierr = > VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > > ierr = > VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > > ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); > > > > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? > > > > This is where I have been digging around an printing stuff. > > > > > > Are you sure the problem isn't related to the "stream business"? > > > > I don't know what that is but I have played around with adding > cudaDeviceSynchronize > > > > > > /* This multiplication sequence is different sequence > > than the CPU version. In particular, the diagonal block > > multiplication kernel is launched in one stream. Then, > > in a separate stream, the data transfers from DeviceToHost > > (with MPI messaging in between), then HostToDevice are > > launched. Once the data transfer stream is synchronized, > > to ensure messaging is complete, the MatMultAdd kernel > > is launched in the original (MatMult) stream to protect > > against race conditions. > > > > This sequence should only be called for GPU computation. */ > > > > Note this comment isn't right and appears to be cut and paste from > somewhere else, since there is no MatMult() nor MatMultAdd kernel here? > > > > Yes, I noticed this. Same as MatMult and not correct here. > > > > > > Anyway to "turn off the stream business" and see if the result is then > correct? > > > > How do you do that? I'm looking at docs on streams but not sure how its > used here. > > > > Perhaps the stream business was done correctly for MatMult() but was > never right for MatMultTranspose()? > > > > Barry > > > > BTW: Unrelated comment, the code > > > > ierr = VecSet(yy,0);CHKERRQ(ierr); > > ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr); > > > > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. > VecCUDAGetArrayWrite() requires that you ignore the values in yy and set > them all yourself so setting them to zero before calling > VecCUDAGetArrayWrite() does nothing except waste time. > > > > > > OK, I'll get rid of it. > > > > > > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> wrote: > > > > > > I am stumped with this GPU bug(s). Maybe someone has an idea. > > > > > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck > detected, but I still have differences between the GPU and CPU transpose > mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh > with
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
In the long run I would like to see smaller specialized chunks of code (with a bit of duplication between them) instead of highly overloaded routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, for multiple add alone and for multiple add with sparse format. Trying to get all the cases right (performance and correctness for the everything at once is unnecessary and risky). Having possible zero size objects (and hence null pointers) doesn't help the complex logic Barry > On Jul 10, 2019, at 10:06 AM, Mark Adams wrote: > > Thanks, you made several changes here, including switches with the workvector > size. I guess I should import this logic to the transpose method(s), except > for the yy==NULL branches ... > > MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own > code. MatMultTranspose_SeqAIJCUSPARSE is very simple. > > Thanks again, > Mark > > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini > wrote: > Mark, > > if the difference is on lvec, I suspect the bug has to do with compressed row > storage. I have fixed a similar bug in MatMult. > you want to check cusparsestruct->workVector->size() against A->cmap->n. > > Stefano > > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev > ha scritto: > > > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. wrote: > > ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); > if (nt != A->rmap->n) > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) > and xx (%D)",A->rmap->n,nt); > ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); > ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); > > So the xx on the GPU appears ok? > > The norm is correct and ... > > The a->B appears ok? > > yes > > But on process 1 the result a->lvec is wrong? > > yes > > > How do you look at the a->lvec? Do you copy it to the CPU and print it? > > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I should > copy it. Maybe I should make a CUDA version of these methods? > > > ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); > ierr = > VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > ierr = > VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); > > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? > > This is where I have been digging around an printing stuff. > > > Are you sure the problem isn't related to the "stream business"? > > I don't know what that is but I have played around with adding > cudaDeviceSynchronize > > > /* This multiplication sequence is different sequence > than the CPU version. In particular, the diagonal block > multiplication kernel is launched in one stream. Then, > in a separate stream, the data transfers from DeviceToHost > (with MPI messaging in between), then HostToDevice are > launched. Once the data transfer stream is synchronized, > to ensure messaging is complete, the MatMultAdd kernel > is launched in the original (MatMult) stream to protect > against race conditions. > > This sequence should only be called for GPU computation. */ > > Note this comment isn't right and appears to be cut and paste from somewhere > else, since there is no MatMult() nor MatMultAdd kernel here? > > Yes, I noticed this. Same as MatMult and not correct here. > > > Anyway to "turn off the stream business" and see if the result is then > correct? > > How do you do that? I'm looking at docs on streams but not sure how its used > here. > > Perhaps the stream business was done correctly for MatMult() but was never > right for MatMultTranspose()? > > Barry > > BTW: Unrelated comment, the code > > ierr = VecSet(yy,0);CHKERRQ(ierr); > ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr); > > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. > VecCUDAGetArrayWrite() requires that you ignore the values in yy and set them > all yourself so setting them to zero before calling VecCUDAGetArrayWrite() > does nothing except waste time. > > > OK, I'll get rid of it. > > > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev > > wrote: > > > > I am stumped with this GPU bug(s). Maybe someone has an idea. > > > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, > > but I still have differences between the GPU and CPU transpose mat-vec. > > I've got it down to a very simple test: bicg/none on a tiny mesh with two > > processors. It works on one processor or with cg/none. So it is the > > transpose mat-vec. > > > > I see that the result of the off-diagonal (a->lvec) is different only proc > > 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and > > vec and printed out matlab vectors. Below is the CPU output and then the >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
Thanks, you made several changes here, including switches with the workvector size. I guess I should import this logic to the transpose method(s), except for the yy==NULL branches ... MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. Thanks again, Mark On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini wrote: > Mark, > > if the difference is on lvec, I suspect the bug has to do with compressed > row storage. I have fixed a similar bug in MatMult. > you want to check cusparsestruct->workVector->size() against A->cmap->n. > > Stefano > > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> ha scritto: > >> >> >> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. >> wrote: >> >>> >>> ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); >>> if (nt != A->rmap->n) >>> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A >>> (%D) and xx (%D)",A->rmap->n,nt); >>> ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); >>> ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); >>> >>> So the xx on the GPU appears ok? >> >> >> The norm is correct and ... >> >> >>> The a->B appears ok? >> >> >> yes >> >> >>> But on process 1 the result a->lvec is wrong? >>> >> >> yes >> >> >>> How do you look at the a->lvec? Do you copy it to the CPU and print it? >>> >> >> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I >> should copy it. Maybe I should make a CUDA version of these methods? >> >> >>> >>> ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); >>> ierr = >>> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); >>> ierr = >>> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); >>> ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); >>> >>> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? >> >> >> This is where I have been digging around an printing stuff. >> >> >>> >>> Are you sure the problem isn't related to the "stream business"? >>> >> >> I don't know what that is but I have played around with adding >> cudaDeviceSynchronize >> >> >>> >>> /* This multiplication sequence is different sequence >>> than the CPU version. In particular, the diagonal block >>> multiplication kernel is launched in one stream. Then, >>> in a separate stream, the data transfers from DeviceToHost >>> (with MPI messaging in between), then HostToDevice are >>> launched. Once the data transfer stream is synchronized, >>> to ensure messaging is complete, the MatMultAdd kernel >>> is launched in the original (MatMult) stream to protect >>> against race conditions. >>> >>> This sequence should only be called for GPU computation. */ >>> >>> Note this comment isn't right and appears to be cut and paste from >>> somewhere else, since there is no MatMult() nor MatMultAdd kernel here? >>> >> >> Yes, I noticed this. Same as MatMult and not correct here. >> >> >>> >>> Anyway to "turn off the stream business" and see if the result is then >>> correct? >> >> >> How do you do that? I'm looking at docs on streams but not sure how its >> used here. >> >> >>> Perhaps the stream business was done correctly for MatMult() but was >>> never right for MatMultTranspose()? >>> >>> Barry >>> >>> BTW: Unrelated comment, the code >>> >>> ierr = VecSet(yy,0);CHKERRQ(ierr); >>> ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr); >>> >>> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. >>> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set >>> them all yourself so setting them to zero before calling >>> VecCUDAGetArrayWrite() does nothing except waste time. >>> >>> >> OK, I'll get rid of it. >> >> >>> >>> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev < >>> petsc-dev@mcs.anl.gov> wrote: >>> > >>> > I am stumped with this GPU bug(s). Maybe someone has an idea. >>> > >>> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck >>> detected, but I still have differences between the GPU and CPU transpose >>> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh >>> with two processors. It works on one processor or with cg/none. So it is >>> the transpose mat-vec. >>> > >>> > I see that the result of the off-diagonal (a->lvec) is different only >>> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat >>> and vec and printed out matlab vectors. Below is the CPU output and then >>> the GPU with a view of the scatter object, which is identical as you can >>> see. >>> > >>> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is >>> wrong ... >>> > >>> > The only/first difference between CPU and GPU is a->lvec (the off >>> diagonal contribution)on processor 1. (you can see the norms are >>> different). Here is the diff on the process 1 a->lvec vector (all values >>>
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. wrote: > > ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); > if (nt != A->rmap->n) > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A > (%D) and xx (%D)",A->rmap->n,nt); > ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); > ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); > > So the xx on the GPU appears ok? The norm is correct and ... > The a->B appears ok? yes > But on process 1 the result a->lvec is wrong? > yes > How do you look at the a->lvec? Do you copy it to the CPU and print it? > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I should copy it. Maybe I should make a CUDA version of these methods? > > ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); > ierr = > VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > ierr = > VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); > > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? This is where I have been digging around an printing stuff. > > Are you sure the problem isn't related to the "stream business"? > I don't know what that is but I have played around with adding cudaDeviceSynchronize > > /* This multiplication sequence is different sequence > than the CPU version. In particular, the diagonal block > multiplication kernel is launched in one stream. Then, > in a separate stream, the data transfers from DeviceToHost > (with MPI messaging in between), then HostToDevice are > launched. Once the data transfer stream is synchronized, > to ensure messaging is complete, the MatMultAdd kernel > is launched in the original (MatMult) stream to protect > against race conditions. > > This sequence should only be called for GPU computation. */ > > Note this comment isn't right and appears to be cut and paste from > somewhere else, since there is no MatMult() nor MatMultAdd kernel here? > Yes, I noticed this. Same as MatMult and not correct here. > > Anyway to "turn off the stream business" and see if the result is then > correct? How do you do that? I'm looking at docs on streams but not sure how its used here. > Perhaps the stream business was done correctly for MatMult() but was never > right for MatMultTranspose()? > > Barry > > BTW: Unrelated comment, the code > > ierr = VecSet(yy,0);CHKERRQ(ierr); > ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr); > > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. > VecCUDAGetArrayWrite() requires that you ignore the values in yy and set > them all yourself so setting them to zero before calling > VecCUDAGetArrayWrite() does nothing except waste time. > > OK, I'll get rid of it. > > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> wrote: > > > > I am stumped with this GPU bug(s). Maybe someone has an idea. > > > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck > detected, but I still have differences between the GPU and CPU transpose > mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh > with two processors. It works on one processor or with cg/none. So it is > the transpose mat-vec. > > > > I see that the result of the off-diagonal (a->lvec) is different only > proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat > and vec and printed out matlab vectors. Below is the CPU output and then > the GPU with a view of the scatter object, which is identical as you can > see. > > > > The matlab B matrix and xx vector are identical. Maybe the GPU copy is > wrong ... > > > > The only/first difference between CPU and GPU is a->lvec (the off > diagonal contribution)on processor 1. (you can see the norms are > different). Here is the diff on the process 1 a->lvec vector (all values > are off). > > > > Any thoughts would be appreciated, > > Mark > > > > 15:30 1 /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m > > 2,12c2,12 > > < % type: seqcuda > > < Vec_0x53738630_0 = [ > > < 9.5702137431412879e+00 > > < 2.1970298791152253e+01 > > < 4.5422290209190646e+00 > > < 2.0185031807270226e+00 > > < 4.2627312508573375e+01 > > < 1.0889191983882025e+01 > > < 1.6038202417695462e+01 > > < 2.7155672033607665e+01 > > < 6.2540357853223556e+00 > > --- > > > % type: seq > > > Vec_0x3a546440_0 = [ > > > 4.5565851251714653e+00 > > > 1.0460532998971189e+01 > > > 2.1626531807270220e+00 > > > 9.6105288923182408e-01 > > > 2.0295782656035659e+01 > > > 5.1845791066529463e+00 > > > 7.6361340020576058e+00 > > > 1.2929401011659799e+01 > > > 2.9776812928669392e+00 > > > > 15:15 130 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 > ./ex56 -cells 2,2,1 > > [0] 27 global equations, 9 vertices > > [0] 27 equations in vector, 9 vertices > > 0 SNES Function norm 1.223958326481e+02 > >
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); if (nt != A->rmap->n) SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) and xx (%D)",A->rmap->n,nt); ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); So the xx on the GPU appears ok? The a->B appears ok? But on process 1 the result a->lvec is wrong? How do you look at the a->lvec? Do you copy it to the CPU and print it? ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); ierr = VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); ierr = VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? Are you sure the problem isn't related to the "stream business"? /* This multiplication sequence is different sequence than the CPU version. In particular, the diagonal block multiplication kernel is launched in one stream. Then, in a separate stream, the data transfers from DeviceToHost (with MPI messaging in between), then HostToDevice are launched. Once the data transfer stream is synchronized, to ensure messaging is complete, the MatMultAdd kernel is launched in the original (MatMult) stream to protect against race conditions. This sequence should only be called for GPU computation. */ Note this comment isn't right and appears to be cut and paste from somewhere else, since there is no MatMult() nor MatMultAdd kernel here? Anyway to "turn off the stream business" and see if the result is then correct? Perhaps the stream business was done correctly for MatMult() but was never right for MatMultTranspose()? Barry BTW: Unrelated comment, the code ierr = VecSet(yy,0);CHKERRQ(ierr); ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr); has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. VecCUDAGetArrayWrite() requires that you ignore the values in yy and set them all yourself so setting them to zero before calling VecCUDAGetArrayWrite() does nothing except waste time. > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev > wrote: > > I am stumped with this GPU bug(s). Maybe someone has an idea. > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, > but I still have differences between the GPU and CPU transpose mat-vec. I've > got it down to a very simple test: bicg/none on a tiny mesh with two > processors. It works on one processor or with cg/none. So it is the transpose > mat-vec. > > I see that the result of the off-diagonal (a->lvec) is different only proc > 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec > and printed out matlab vectors. Below is the CPU output and then the GPU with > a view of the scatter object, which is identical as you can see. > > The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong > ... > > The only/first difference between CPU and GPU is a->lvec (the off diagonal > contribution)on processor 1. (you can see the norms are different). Here is > the diff on the process 1 a->lvec vector (all values are off). > > Any thoughts would be appreciated, > Mark > > 15:30 1 /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m > 2,12c2,12 > < % type: seqcuda > < Vec_0x53738630_0 = [ > < 9.5702137431412879e+00 > < 2.1970298791152253e+01 > < 4.5422290209190646e+00 > < 2.0185031807270226e+00 > < 4.2627312508573375e+01 > < 1.0889191983882025e+01 > < 1.6038202417695462e+01 > < 2.7155672033607665e+01 > < 6.2540357853223556e+00 > --- > > % type: seq > > Vec_0x3a546440_0 = [ > > 4.5565851251714653e+00 > > 1.0460532998971189e+01 > > 2.1626531807270220e+00 > > 9.6105288923182408e-01 > > 2.0295782656035659e+01 > > 5.1845791066529463e+00 > > 7.6361340020576058e+00 > > 1.2929401011659799e+01 > > 2.9776812928669392e+00 > > 15:15 130 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 > ./ex56 -cells 2,2,1 > [0] 27 global equations, 9 vertices > [0] 27 equations in vector, 9 vertices > 0 SNES Function norm 1.223958326481e+02 > 0 KSP Residual norm 1.223958326481e+02 > [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= > 1.424708937136e+00 > [1] |x|= 1.223958326481e+02 |a->lvec|= 2.844171413778e+01 |B|= > 1.424708937136e+00 > [1] 1) |yy|= 2.007423334680e+02 > [0] 1) |yy|= 2.007423334680e+02 > [0] 2) |yy|= 1.957605719265e+02 > [1] 2) |yy|= 1.957605719265e+02 > [1] Number sends = 1; Number to self = 0 > [1] 0 length = 9 to whom 0 > Now the indices for all remote sends (in order by process sent to) > [1] 9 > [1] 10 > [1] 11 > [1] 12 > [1] 13 > [1] 14 > [1] 15 > [1] 16 > [1] 17 > [1] Number receives = 1; Number from self = 0 > [1] 0 length 9 from whom 0 > Now the indices for all remote receives (in order by process received from) > [1] 0 > [1] 1
Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT
I am stumped with this GPU bug(s). Maybe someone has an idea. I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, but I still have differences between the GPU and CPU transpose mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh with two processors. It works on one processor or with cg/none. So it is the transpose mat-vec. I see that the result of the off-diagonal (a->lvec) is different* only proc 1*. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec and printed out matlab vectors. Below is the CPU output and then the GPU with a view of the scatter object, which is identical as you can see. The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong ... The only/first difference between CPU and GPU is a->lvec (the off diagonal contribution)on processor 1. (you can see the norms are *different*). Here is the diff on the process 1 a->lvec vector (all values are off). Any thoughts would be appreciated, Mark 15:30 1 /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m 2,12c2,12 < % type: seqcuda < Vec_0x53738630_0 = [ < 9.5702137431412879e+00 < 2.1970298791152253e+01 < 4.5422290209190646e+00 < 2.0185031807270226e+00 < 4.2627312508573375e+01 < 1.0889191983882025e+01 < 1.6038202417695462e+01 < 2.7155672033607665e+01 < 6.2540357853223556e+00 --- > % type: seq > Vec_0x3a546440_0 = [ > 4.5565851251714653e+00 > 1.0460532998971189e+01 > 2.1626531807270220e+00 > 9.6105288923182408e-01 > 2.0295782656035659e+01 > 5.1845791066529463e+00 > 7.6361340020576058e+00 > 1.2929401011659799e+01 > 2.9776812928669392e+00 15:15 130 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 [0] 27 global equations, 9 vertices [0] 27 equations in vector, 9 vertices 0 SNES Function norm 1.223958326481e+02 0 KSP Residual norm 1.223958326481e+02 [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= 1.424708937136e+00 [1] |x|= 1.223958326481e+02 |a->lvec|= *2.844171413778e*+01 |B|= 1.424708937136e+00 [1] 1) |yy|= 2.007423334680e+02 [0] 1) |yy|= 2.007423334680e+02 [0] 2) |yy|= 1.957605719265e+02 [1] 2) |yy|= 1.957605719265e+02 [1] Number sends = 1; Number to self = 0 [1] 0 length = 9 to whom 0 Now the indices for all remote sends (in order by process sent to) [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 [1] 16 [1] 17 [1] Number receives = 1; Number from self = 0 [1] 0 length 9 from whom 0 Now the indices for all remote receives (in order by process received from) [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 1 KSP Residual norm 8.199932342150e+01 Linear solve did not converge due to DIVERGED_ITS iterations 1 Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 15:19 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 *-ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda* [0] 27 global equations, 9 vertices [0] 27 equations in vector, 9 vertices 0 SNES Function norm 1.223958326481e+02 0 KSP Residual norm 1.223958326481e+02 [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= 1.424708937136e+00 [1] |x|= 1.223958326481e+02 |a->lvec|= *5.973624458725e*+01 |B|= 1.424708937136e+00 [0] 1) |yy|= 2.007423334680e+02 [1] 1) |yy|= 2.007423334680e+02 [0] 2) |yy|= 1.953571867633e+02 [1] 2) |yy|= 1.953571867633e+02 [1] Number sends = 1; Number to self = 0 [1] 0 length = 9 to whom 0 Now the indices for all remote sends (in order by process sent to) [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 [1] 16 [1] 17 [1] Number receives = 1; Number from self = 0 [1] 0 length 9 from whom 0 Now the indices for all remote receives (in order by process received from) [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 1 KSP Residual norm 8.199932342150e+01