Junchao and Barry, I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's robustify branch. Is this in master yet? If so, I'd like to get my branch merged to master, then merge Junchao's branch. Then us it.
I think we were waiting for some refactoring from Karl to proceed. Anyway, I'm not sure how to proceed. Thanks, Mark On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > > > > On Sat, Aug 31, 2019 at 8:04 PM Mark Adams <mfad...@lbl.gov> wrote: > >> >> >> On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. <bsm...@mcs.anl.gov> >> wrote: >> >>> >>> Any explanation for why the scaling is much better for CPUs and than >>> GPUs? Is it the "extra" time needed for communication from the GPUs? >>> >> >> The GPU work is well load balanced so it weak scales perfectly. When you >> put that work in the CPU you get more perfectly scalable work added so it >> looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec. >> from the 1 node to 512 node case for both GPU and CPU, because this >> non-scaling is from communication that is the same for both cases >> >> >>> >>> Perhaps you could try the GPU version with Junchao's new MPI-aware >>> CUDA branch (in the gitlab merge requests) that can speed up the >>> communication from GPUs? >>> >> >> Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual? >> > > Use jsrun --smpiargs="-gpu" to enable IBM MPI's cuda-aware support, then > add -use_gpu_aware_mpi in option to let PETSc use that feature. > > >> >> >>> >>> Barry >>> >>> >>> > On Aug 30, 2019, at 11:56 AM, Mark Adams <mfad...@lbl.gov> wrote: >>> > >>> > Here is some more weak scaling data with a fixed number of iterations >>> (I have given a test with the numerical problems to ORNL and they said they >>> would give it to Nvidia). >>> > >>> > I implemented an option to "spread" the reduced coarse grids across >>> the whole machine as opposed to a "compact" layout where active processes >>> are laid out in simple lexicographical order. This spread approach looks a >>> little better. >>> > >>> > Mark >>> > >>> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. <bsm...@mcs.anl.gov> >>> wrote: >>> > >>> > Ahh, PGI compiler, that explains it :-) >>> > >>> > Ok, thanks. Don't worry about the runs right now. We'll figure out >>> the fix. The code is just >>> > >>> > *a = (PetscReal)strtod(name,endptr); >>> > >>> > could be a compiler bug. >>> > >>> > >>> > >>> > >>> > > On Aug 14, 2019, at 9:23 PM, Mark Adams <mfad...@lbl.gov> wrote: >>> > > >>> > > I am getting this error with single: >>> > > >>> > > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 >>> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type >>> aijcusparse -fp_trap >>> > > [0] 81 global equations, 27 vertices >>> > > [0]PETSC ERROR: *** unknown floating point error occurred *** >>> > > [0]PETSC ERROR: The specific exception can be determined by running >>> in a debugger. When the >>> > > [0]PETSC ERROR: debugger traps the signal, the exception can be >>> found with fetestexcept(0x3e000000) >>> > > [0]PETSC ERROR: where the result is a bitwise OR of the following >>> flags: >>> > > [0]PETSC ERROR: FE_INVALID=0x20000000 FE_DIVBYZERO=0x4000000 >>> FE_OVERFLOW=0x10000000 FE_UNDERFLOW=0x8000000 FE_INEXACT=0x2000000 >>> > > [0]PETSC ERROR: Try option -start_in_debugger >>> > > [0]PETSC ERROR: likely location of problem given in stack below >>> > > [0]PETSC ERROR: --------------------- Stack Frames >>> ------------------------------------ >>> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not >>> available, >>> > > [0]PETSC ERROR: INSTEAD the line number of the start of the >>> function >>> > > [0]PETSC ERROR: is given. >>> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c >>> > > [0]PETSC ERROR: [0] PetscStrtod line 1964 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c >>> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c >>> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c >>> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 >>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c >>> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 >>> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c >>> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 >>> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c >>> > > [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> > > [0]PETSC ERROR: Floating point exception >>> > > [0]PETSC ERROR: trapped floating point error >>> > > [0]PETSC ERROR: See >>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>> shooting. >>> > > [0]PETSC ERROR: Petsc Development GIT revision: >>> v3.11.3-1685-gd3eb2e1 GIT Date: 2019-08-13 06:33:29 -0400 >>> > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda >>> named h36n11 by adams Wed Aug 14 22:21:56 2019 >>> > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC >>> --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" >>> FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 >>> --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc >>> CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis >>> --download-fblaslapack --with-x=0 --with-64-bit-indices=0 >>> --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda >>> > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file >>> > > >>> -------------------------------------------------------------------------- >>> > > >>> > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. <bsm...@mcs.anl.gov> >>> wrote: >>> > > >>> > > Oh, doesn't even have to be that large. We just need to be able to >>> look at the flop rates (as a surrogate for run times) and compare with the >>> previous runs. So long as the size per process is pretty much the same that >>> is good enough. >>> > > >>> > > Barry >>> > > >>> > > >>> > > > On Aug 14, 2019, at 8:45 PM, Mark Adams <mfad...@lbl.gov> wrote: >>> > > > >>> > > > I can run single, I just can't scale up. But I can use like 1500 >>> processors. >>> > > > >>> > > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. < >>> bsm...@mcs.anl.gov> wrote: >>> > > > >>> > > > Oh, are all your integers 8 bytes? Even on one node? >>> > > > >>> > > > Once Karl's new middleware is in place we should see about >>> reducing to 4 bytes on the GPU. >>> > > > >>> > > > Barry >>> > > > >>> > > > >>> > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams <mfad...@lbl.gov> wrote: >>> > > > > >>> > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats >>> and 8 byte integers ... I could use 32 bit ints and just not scale out. >>> > > > > >>> > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. < >>> bsm...@mcs.anl.gov> wrote: >>> > > > > >>> > > > > Mark, >>> > > > > >>> > > > > Oh, I don't even care if it converges, just put in a fixed >>> number of iterations. The idea is to just get a baseline of the possible >>> improvement. >>> > > > > >>> > > > > ECP is literally dropping millions into research on "multi >>> precision" computations on GPUs, we need to have some actual numbers for >>> the best potential benefit to determine how much we invest in further >>> investigating it, or not. >>> > > > > >>> > > > > I am not expressing any opinions on the approach, we are >>> just in the fact gathering stage. >>> > > > > >>> > > > > >>> > > > > Barry >>> > > > > >>> > > > > >>> > > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams <mfad...@lbl.gov> >>> wrote: >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. < >>> bsm...@mcs.anl.gov> wrote: >>> > > > > > >>> > > > > > Mark, >>> > > > > > >>> > > > > > Would you be able to make one run using single precision? >>> Just single everywhere since that is all we support currently? >>> > > > > > >>> > > > > > >>> > > > > > Experience in engineering at least is single does not work for >>> FE elasticity. I have tried it many years ago and have heard this from >>> others. This problem is pretty simple other than using Q2. I suppose I >>> could try it, but just be aware the FE people might say that single sucks. >>> > > > > > >>> > > > > > The results will give us motivation (or anti-motivation) to >>> have support for running KSP (or PC (or Mat) in single precision while the >>> simulation is double. >>> > > > > > >>> > > > > > Thanks. >>> > > > > > >>> > > > > > Barry >>> > > > > > >>> > > > > > For example if the GPU speed on KSP is a factor of 3 over the >>> double on GPUs this is serious motivation. >>> > > > > > >>> > > > > > >>> > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams <mfad...@lbl.gov> >>> wrote: >>> > > > > > > >>> > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting >>> about 4x GPU speedup with 98K dof/proc (3D Q2 elasticity). >>> > > > > > > >>> > > > > > > This is weak scaling of a solve. There is growth in >>> iteration count folded in here. I should put rtol in the title and/or run a >>> fixed number of iterations and make it clear in the title. >>> > > > > > > >>> > > > > > > Comments welcome. >>> > > > > > > >>> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> > >>> <weak_scaling_gpu_compact_spread.png><weak_scaling_cpu.png><spread.tar><compact.tar> >>> >>>