On Sat, Aug 31, 2019 at 8:04 PM Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:


On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. 
<bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:

  Any explanation for why the scaling is much better for CPUs and than GPUs? Is 
it the "extra" time needed for communication from the GPUs?

The GPU work is well load balanced so it weak scales perfectly. When you put 
that work in the CPU you get more perfectly scalable work added so it looks 
better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from the 
1 node to 512 node case for both GPU and CPU, because this non-scaling is from 
communication that is the same for both cases


  Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA 
branch (in the gitlab merge requests)  that can speed up the communication from 
GPUs?

Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?

Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then add 
-use_gpu_aware_mpi in option to let PETSc use that feature.



   Barry


> On Aug 30, 2019, at 11:56 AM, Mark Adams 
> <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
>
> Here is some more weak scaling data with a fixed number of iterations (I have 
> given a test with the numerical problems to ORNL and they said they would 
> give it to Nvidia).
>
> I implemented an option to "spread" the reduced coarse grids across the whole 
> machine as opposed to a "compact" layout where active processes are laid out 
> in simple lexicographical order. This spread approach looks a little better.
>
> Mark
>
> On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Ahh, PGI compiler, that explains it :-)
>
>   Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. 
> The code is just
>
>   *a = (PetscReal)strtod(name,endptr);
>
>   could be a compiler bug.
>
>
>
>
> > On Aug 14, 2019, at 9:23 PM, Mark Adams 
> > <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
> >
> > I am getting this error with single:
> >
> > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 
> > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type 
> > aijcusparse -fp_trap
> > [0] 81 global equations, 27 vertices
> > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > [0]PETSC ERROR: The specific exception can be determined by running in a 
> > debugger.  When the
> > [0]PETSC ERROR: debugger traps the signal, the exception can be found with 
> > fetestexcept(0x3e000000)
> > [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> > [0]PETSC ERROR: FE_INVALID=0x20000000 FE_DIVBYZERO=0x4000000 
> > FE_OVERFLOW=0x10000000 FE_UNDERFLOW=0x8000000 FE_INEXACT=0x2000000
> > [0]PETSC ERROR: Try option -start_in_debugger
> > [0]PETSC ERROR: likely location of problem given in stack below
> > [0]PETSC ERROR: ---------------------  Stack Frames 
> > ------------------------------------
> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [0]PETSC ERROR:       is given.
> > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > [0]PETSC ERROR: [0] PetscStrtod line 1964 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 
> > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 
> > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > [0]PETSC ERROR: --------------------- Error Message 
> > --------------------------------------------------------------
> > [0]PETSC ERROR: Floating point exception
> > [0]PETSC ERROR: trapped floating point error
> > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> > for trouble shooting.
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT 
> > Date: 2019-08-13 06:33:29 -0400
> > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named 
> > h36n11 by adams Wed Aug 14 22:21:56 2019
> > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC 
> > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" 
> > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 
> > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc 
> > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis 
> > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 
> > --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
> > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
> > --------------------------------------------------------------------------
> >
> > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. 
> > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
> >
> >   Oh, doesn't even have to be that large. We just need to be able to look 
> > at the flop rates (as a surrogate for run times) and compare with the 
> > previous runs. So long as the size per process is pretty much the same that 
> > is good enough.
> >
> >    Barry
> >
> >
> > > On Aug 14, 2019, at 8:45 PM, Mark Adams 
> > > <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
> > >
> > > I can run single, I just can't scale up. But I can use like 1500 
> > > processors.
> > >
> > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. 
> > > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
> > >
> > >   Oh, are all your integers 8 bytes? Even on one node?
> > >
> > >   Once Karl's new middleware is in place we should see about reducing to 
> > > 4 bytes on the GPU.
> > >
> > >    Barry
> > >
> > >
> > > > On Aug 14, 2019, at 7:44 PM, Mark Adams 
> > > > <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
> > > >
> > > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 
> > > > byte integers ... I could use 32 bit ints and just not scale out.
> > > >
> > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. 
> > > > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
> > > >
> > > >  Mark,
> > > >
> > > >    Oh, I don't even care if it converges, just put in a fixed number of 
> > > > iterations. The idea is to just get a baseline of the possible 
> > > > improvement.
> > > >
> > > >     ECP is literally dropping millions into research on "multi 
> > > > precision" computations on GPUs, we need to have some actual numbers 
> > > > for the best potential benefit to determine how much we invest in 
> > > > further investigating it, or not.
> > > >
> > > >     I am not expressing any opinions on the approach, we are just in 
> > > > the fact gathering stage.
> > > >
> > > >
> > > >    Barry
> > > >
> > > >
> > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams 
> > > > > <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> > > > > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
> > > > >
> > > > >   Mark,
> > > > >
> > > > >    Would you be able to make one run using single precision? Just 
> > > > > single everywhere since that is all we support currently?
> > > > >
> > > > >
> > > > > Experience in engineering at least is single does not work for FE 
> > > > > elasticity. I have tried it many years ago and have heard this from 
> > > > > others. This problem is pretty simple other than using Q2. I suppose 
> > > > > I could try it, but just be aware the FE people might say that single 
> > > > > sucks.
> > > > >
> > > > >    The results will give us motivation (or anti-motivation) to have 
> > > > > support for running KSP (or PC (or Mat)  in single precision while 
> > > > > the simulation is double.
> > > > >
> > > > >    Thanks.
> > > > >
> > > > >      Barry
> > > > >
> > > > > For example if the GPU speed on KSP is a factor of 3 over the double 
> > > > > on GPUs this is serious motivation.
> > > > >
> > > > >
> > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams 
> > > > > > <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
> > > > > >
> > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x 
> > > > > > GPU speedup with 98K dof/proc (3D Q2 elasticity).
> > > > > >
> > > > > > This is weak scaling of a solve. There is growth in iteration count 
> > > > > > folded in here. I should put rtol in the title and/or run a fixed 
> > > > > > number of iterations and make it clear in the title.
> > > > > >
> > > > > > Comments welcome.
> > > > > > <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png>
> > > > >
> > > >
> > >
> >
>
> <weak_scaling_gpu_compact_spread.png><weak_scaling_cpu.png><spread.tar><compact.tar>

Reply via email to