Thanks Karl, Matt, I thought I created all vectors of CUSP type. I'll double check. I was trying to find vectors that I may have accidentally setup not with CUSP type through somehow interface with the SNES solver?
I'll also double check w/ the cuda examples as Karl suggested. There are 6 Tesla M2070 on this box, but i'm only running on one of them. On Sat, Nov 17, 2012 at 2:42 PM, Matthew Knepley <knepley at gmail.com> wrote: > On Sat, Nov 17, 2012 at 3:05 PM, David Fuentes <fuentesdt at gmail.com> > wrote: > > Thanks Jed. > > I was trying to run it in dbg mode to verify if all significant parts of > the > > solver were running on the GPU and not on the CPU by mistake. > > I cant pinpoint what part of the solver is running on the CPU. When I run > > top while running the solver there seems to be ~800% CPU utilization > > that I wasn't expecting. I cant tell if i'm slowing things down by > > transferring between CPU/GPU on accident? > > 1) I am not sure what you mean by 800%, but it is definitely > legitimate to want to know where you are computing. > > 2) At least some computation is happening on the GPU. I can tell this from > the > Vec/MatCopyToGPU events. > > 3) Your flop rates are not great. The MatMult is about half what we > get on the Tesla, but you > could have another card without good support for double precision. > The vector ops however > are pretty bad. > > 4) It looks like half the flops are in MatMult, which is definitely on > the card, and the others are in > vector operations. Do you create any other vectors without the CUSP > type? > > Matt > > > thanks again, > > df > > > > On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote: > >> > >> Please read the large boxed message about debugging mode. > >> > >> (Replying from phone so can't make it 72 point blinking red, sorry.) > >> > >> On Nov 17, 2012 1:41 PM, "David Fuentes" <fuentesdt at gmail.com> wrote: > >>> > >>> thanks Matt, > >>> > >>> My log summary is below. > >>> > >>> > >>> > ************************************************************************************************************************ > >>> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > >>> -fCourier9' to print this document *** > >>> > >>> > ************************************************************************************************************************ > >>> > >>> ---------------------------------------------- PETSc Performance > Summary: > >>> ---------------------------------------------- > >>> > >>> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named > >>> SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012 > >>> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT > 2012 > >>> > >>> Max Max/Min Avg Total > >>> Time (sec): 3.164e+01 1.00000 3.164e+01 > >>> Objects: 4.100e+01 1.00000 4.100e+01 > >>> Flops: 2.561e+09 1.00000 2.561e+09 2.561e+09 > >>> Flops/sec: 8.097e+07 1.00000 8.097e+07 8.097e+07 > >>> Memory: 2.129e+08 1.00000 2.129e+08 > >>> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > >>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > >>> MPI Reductions: 4.230e+02 1.00000 > >>> > >>> Flop counting convention: 1 flop = 1 real number operation of type > >>> (multiply/divide/add/subtract) > >>> e.g., VecAXPY() for real vectors of length > N > >>> --> 2N flops > >>> and VecAXPY() for complex vectors of > length N > >>> --> 8N flops > >>> > >>> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > >>> --- -- Message Lengths -- -- Reductions -- > >>> Avg %Total Avg %Total counts > >>> %Total Avg %Total counts %Total > >>> 0: Main Stage: 3.1636e+01 100.0% 2.5615e+09 100.0% 0.000e+00 > >>> 0.0% 0.000e+00 0.0% 4.220e+02 99.8% > >>> > >>> > >>> > ------------------------------------------------------------------------------------------------------------------------ > >>> See the 'Profiling' chapter of the users' manual for details on > >>> interpreting output. > >>> Phase summary info: > >>> Count: number of times phase was executed > >>> Time and Flops: Max - maximum over all processors > >>> Ratio - ratio of maximum to minimum over all > >>> processors > >>> Mess: number of messages sent > >>> Avg. len: average message length > >>> Reduct: number of global reductions > >>> Global: entire computation > >>> Stage: stages of a computation. Set stages with PetscLogStagePush() > >>> and PetscLogStagePop(). > >>> %T - percent time in this phase %f - percent flops in > this > >>> phase > >>> %M - percent messages in this phase %L - percent message > >>> lengths in this phase > >>> %R - percent reductions in this phase > >>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > >>> over all processors) > >>> > >>> > ------------------------------------------------------------------------------------------------------------------------ > >>> > >>> > >>> ########################################################## > >>> # # > >>> # WARNING!!! # > >>> # # > >>> # This code was compiled with a debugging option, # > >>> # To get timing results run ./configure # > >>> # using --with-debugging=no, the performance will # > >>> # be generally two or three times faster. # > >>> # # > >>> ########################################################## > >>> > >>> > >>> Event Count Time (sec) Flops > >>> --- Global --- --- Stage --- Total > >>> Max Ratio Max Ratio Max Ratio Mess Avg > len > >>> Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s > >>> > >>> > ------------------------------------------------------------------------------------------------------------------------ > >>> > >>> --- Event Stage 0: Main Stage > >>> > >>> ComputeFunction 52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 3.0e+00 1 0 0 0 1 1 0 0 0 1 0 > >>> VecDot 50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 3025 > >>> VecMDot 50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 741 > >>> VecNorm 200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 15 0 0 0 0 15 0 0 0 3963 > >>> VecScale 100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 719 > >>> VecCopy 150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 > >>> VecSet 164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > >>> VecAXPY 50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 3014 > >>> VecWAXPY 50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 1 2 0 0 0 1 2 0 0 0 167 > >>> VecMAXPY 100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 2 8 0 0 0 2 8 0 0 0 356 > >>> VecPointwiseMult 100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 2 4 0 0 0 2 4 0 0 0 183 > >>> VecScatterBegin 53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > >>> VecReduceArith 101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 8 0 0 0 0 8 0 0 0 2801 > >>> VecReduceComm 51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>> VecNormalize 100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 1 11 0 0 0 1 11 0 0 0 1568 > >>> VecCUSPCopyTo 152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 > >>> VecCUSPCopyFrom 201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 > >>> MatMult 100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 2 49 0 0 0 2 49 0 0 0 1825 > >>> MatAssemblyBegin 3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>> MatAssemblyEnd 3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > >>> MatZeroEntries 1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>> MatCUSPCopyTo 3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>> SNESSolve 1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 > 0.0e+00 > >>> 3.7e+02 70100 0 0 88 70100 0 0 89 116 > >>> SNESFunctionEval 51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > >>> SNESJacobianEval 50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 0.0e+00 4 0 0 0 0 4 0 0 0 0 0 > >>> SNESLineSearch 50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 > 0.0e+00 > >>> 5.0e+01 20 45 0 0 12 20 45 0 0 12 184 > >>> KSPGMRESOrthog 50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 > 0.0e+00 > >>> 5.0e+01 1 8 0 0 12 1 8 0 0 12 480 > >>> KSPSetUp 50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 1.5e+01 0 0 0 0 4 0 0 0 0 4 0 > >>> KSPSolve 50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 > 0.0e+00 > >>> 3.2e+02 42 55 0 0 75 42 55 0 0 75 106 > >>> PCSetUp 50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 > >>> 4.9e+01 6 0 0 0 12 6 0 0 0 12 0 > >>> PCApply 100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 > 0.0e+00 > >>> 4.0e+00 2 4 0 0 1 2 4 0 0 1 169 > >>> > >>> > ------------------------------------------------------------------------------------------------------------------------ > >>> > >>> Memory usage is given in bytes: > >>> > >>> Object Type Creations Destructions Memory Descendants' > >>> Mem. > >>> Reports information only for process 0. > >>> > >>> --- Event Stage 0: Main Stage > >>> > >>> Container 2 2 1096 0 > >>> Vector 16 16 108696592 0 > >>> Vector Scatter 2 2 1240 0 > >>> Matrix 1 1 96326824 0 > >>> Distributed Mesh 3 3 7775936 0 > >>> Bipartite Graph 6 6 4104 0 > >>> Index Set 5 5 3884908 0 > >>> IS L to G Mapping 1 1 3881760 0 > >>> SNES 1 1 1268 0 > >>> SNESLineSearch 1 1 840 0 > >>> Viewer 1 0 0 0 > >>> Krylov Solver 1 1 18288 0 > >>> Preconditioner 1 1 792 0 > >>> > >>> > ======================================================================================================================== > >>> Average time to get PetscTime(): 9.53674e-08 > >>> #PETSc Option Table entries: > >>> -da_vec_type cusp > >>> -dm_mat_type seqaijcusp > >>> -ksp_monitor > >>> -log_summary > >>> -pc_type jacobi > >>> -snes_converged_reason > >>> -snes_monitor > >>> #End of PETSc Option Table entries > >>> Compiled without FORTRAN kernels > >>> Compiled with full precision matrices (default) > >>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > >>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 > >>> Configure run at: Fri Nov 16 08:40:52 2012 > >>> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr > >>> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 > --CXXFLAGS=-O0 > >>> --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0 > >>> > --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]" > >>> --download-blacs --download-superlu_dist --download-triangle > >>> --download-parmetis --download-metis --download-mumps > --download-scalapack > >>> --with-cuda=1 --with-cusp=1 --with-thrust=1 > >>> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1 > >>> --download-exodusii=yes --download-netcdf --with-boost=1 > >>> --with-boost-dir=/usr --download-fiat=yes --download-generator > >>> --download-scientificpython --with-matlab=1 --with-matlab-engine=1 > >>> --with-matlab-dir=/opt/MATLAB/R2011a > >>> ----------------------------------------- > >>> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2 > >>> Machine characteristics: > >>> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid > >>> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4 > >>> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg > >>> ----------------------------------------- > >>> > >>> Using C compiler: /usr/bin/mpicxx -O0 -g -fPIC ${COPTFLAGS} > ${CFLAGS} > >>> Using Fortran compiler: /usr/bin/mpif90 -fPIC -Wall > -Wno-unused-variable > >>> -g ${FOPTFLAGS} ${FFLAGS} > >>> ----------------------------------------- > >>> > >>> Using include paths: > >>> > -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include > >>> -I/opt/apps/PETSC/petsc-3.3-p4/include > >>> -I/opt/apps/PETSC/petsc-3.3-p4/include > >>> > -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include > >>> -I/opt/apps/cuda/4.2//cuda/include > >>> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve > >>> -I/opt/MATLAB/R2011a/extern/include -I/usr/include > >>> > -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include > >>> > -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include > >>> -I/usr/include/mpich2 > >>> ----------------------------------------- > >>> > >>> Using C linker: /usr/bin/mpicxx > >>> Using Fortran linker: /usr/bin/mpif90 > >>> Using libraries: > >>> > -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib > >>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib > >>> -lpetsc > >>> > -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib > >>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib > >>> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps > -lsmumps > >>> -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs > >>> -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 > -L/opt/apps/cuda/4.2//cuda/lib64 > >>> -lcufft -lcublas -lcudart -lcusparse > >>> > -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64 > >>> -L/opt/MATLAB/R2011a/bin/glnxa64 > -L/opt/MATLAB/R2011a/extern/lib/glnxa64 > >>> -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc > >>> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib > >>> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread > >>> -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf > >>> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3 > >>> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm > >>> -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread > -lrt > >>> -lgcc_s -ldl > >>> ----------------------------------------- > >>> > >>> > >>> > >>> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <knepley at gmail.com> > >>> wrote: > >>>> > >>>> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <fuentesdt at gmail.com> > >>>> wrote: > >>>> > Hi, > >>>> > > >>>> > I'm using petsc 3.3p4 > >>>> > I'm trying to run a nonlinear SNES solver on GPU with gmres and > jacobi > >>>> > PC > >>>> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and > jacobian > >>>> > matrix > >>>> > respectively. > >>>> > When running top I still see significant CPU utilization (800-900 > >>>> > %CPU) > >>>> > during the solve ? possibly from some multithreaded operations ? > >>>> > > >>>> > Is this expected ? > >>>> > I was thinking that since I input everything into the solver as a > CUSP > >>>> > datatype, all linear algebra operations would be on the GPU device > >>>> > from > >>>> > there and wasn't expecting to see such CPU utilization during the > >>>> > solve ? > >>>> > Do I probably have an error in my code somewhere ? > >>>> > >>>> We cannot answer performance questions without -log_summary > >>>> > >>>> Matt > >>>> > >>>> > Thanks, > >>>> > David > >>>> > >>>> > >>>> > >>>> -- > >>>> What most experimenters take for granted before they begin their > >>>> experiments is infinitely more interesting than any results to which > >>>> their experiments lead. > >>>> -- Norbert Wiener > >>> > >>> > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121117/6a2088e4/attachment-0001.html>