Thanks Karl, Matt,

I thought I created all vectors of CUSP type. I'll double check.
I was trying to find vectors that I may have accidentally setup not with
CUSP type through somehow interface with the SNES solver?

I'll also double check w/ the cuda examples as Karl suggested.
There are 6 Tesla M2070 on this box, but i'm only running on one of them.



On Sat, Nov 17, 2012 at 2:42 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Sat, Nov 17, 2012 at 3:05 PM, David Fuentes <fuentesdt at gmail.com>
> wrote:
> > Thanks Jed.
> > I was trying to run it in dbg mode to verify if all significant parts of
> the
> > solver were running on the GPU and not on the CPU by mistake.
> > I cant pinpoint what part of the solver is running on the CPU. When I run
> > top while running the solver there seems to be ~800% CPU utilization
> > that I wasn't expecting. I cant tell if i'm slowing things down by
> > transferring between CPU/GPU on accident?
>
> 1) I am not sure what you mean by 800%, but it is definitely
> legitimate to want to know where you are computing.
>
> 2) At least some computation is happening on the GPU. I can tell this from
> the
>     Vec/MatCopyToGPU events.
>
> 3) Your flop rates are not great. The MatMult is about half what we
> get on the Tesla, but you
>     could have another card without good support for double precision.
> The vector ops however
>     are pretty bad.
>
> 4) It looks like half the flops are in MatMult, which is definitely on
> the card, and the others are in
>     vector operations. Do you create any other vectors without the CUSP
> type?
>
>    Matt
>
> > thanks again,
> > df
> >
> > On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> >>
> >> Please read the large boxed message about debugging mode.
> >>
> >> (Replying from phone so can't make it 72 point blinking red, sorry.)
> >>
> >> On Nov 17, 2012 1:41 PM, "David Fuentes" <fuentesdt at gmail.com> wrote:
> >>>
> >>> thanks Matt,
> >>>
> >>> My log summary is below.
> >>>
> >>>
> >>>
> ************************************************************************************************************************
> >>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> >>> -fCourier9' to print this document            ***
> >>>
> >>>
> ************************************************************************************************************************
> >>>
> >>> ---------------------------------------------- PETSc Performance
> Summary:
> >>> ----------------------------------------------
> >>>
> >>> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named
> >>> SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
> >>> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT
> 2012
> >>>
> >>>                          Max       Max/Min        Avg      Total
> >>> Time (sec):           3.164e+01      1.00000   3.164e+01
> >>> Objects:              4.100e+01      1.00000   4.100e+01
> >>> Flops:                2.561e+09      1.00000   2.561e+09  2.561e+09
> >>> Flops/sec:            8.097e+07      1.00000   8.097e+07  8.097e+07
> >>> Memory:               2.129e+08      1.00000              2.129e+08
> >>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> >>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> >>> MPI Reductions:       4.230e+02      1.00000
> >>>
> >>> Flop counting convention: 1 flop = 1 real number operation of type
> >>> (multiply/divide/add/subtract)
> >>>                             e.g., VecAXPY() for real vectors of length
> N
> >>> --> 2N flops
> >>>                             and VecAXPY() for complex vectors of
> length N
> >>> --> 8N flops
> >>>
> >>> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> >>> ---  -- Message Lengths --  -- Reductions --
> >>>                         Avg     %Total     Avg     %Total   counts
> >>> %Total     Avg         %Total   counts   %Total
> >>>  0:      Main Stage: 3.1636e+01 100.0%  2.5615e+09 100.0%  0.000e+00
> >>> 0.0%  0.000e+00        0.0%  4.220e+02  99.8%
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------
> >>> See the 'Profiling' chapter of the users' manual for details on
> >>> interpreting output.
> >>> Phase summary info:
> >>>    Count: number of times phase was executed
> >>>    Time and Flops: Max - maximum over all processors
> >>>                    Ratio - ratio of maximum to minimum over all
> >>> processors
> >>>    Mess: number of messages sent
> >>>    Avg. len: average message length
> >>>    Reduct: number of global reductions
> >>>    Global: entire computation
> >>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
> >>> and PetscLogStagePop().
> >>>       %T - percent time in this phase         %f - percent flops in
> this
> >>> phase
> >>>       %M - percent messages in this phase     %L - percent message
> >>> lengths in this phase
> >>>       %R - percent reductions in this phase
> >>>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> >>> over all processors)
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------
> >>>
> >>>
> >>>       ##########################################################
> >>>       #                                                        #
> >>>       #                          WARNING!!!                    #
> >>>       #                                                        #
> >>>       #   This code was compiled with a debugging option,      #
> >>>       #   To get timing results run ./configure                #
> >>>       #   using --with-debugging=no, the performance will      #
> >>>       #   be generally two or three times faster.              #
> >>>       #                                                        #
> >>>       ##########################################################
> >>>
> >>>
> >>> Event                Count      Time (sec)     Flops
> >>> --- Global ---  --- Stage ---   Total
> >>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len
> >>> Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------
> >>>
> >>> --- Event Stage 0: Main Stage
> >>>
> >>> ComputeFunction       52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 3.0e+00  1  0  0  0  1   1  0  0  0  1     0
> >>> VecDot                50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  4  0  0  0   0  4  0  0  0  3025
> >>> VecMDot               50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   741
> >>> VecNorm              200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0 15  0  0  0   0 15  0  0  0  3963
> >>> VecScale             100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   719
> >>> VecCopy              150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> >>> VecSet               164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> >>> VecAXPY               50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  4  0  0  0   0  4  0  0  0  3014
> >>> VecWAXPY              50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  1  2  0  0  0   1  2  0  0  0   167
> >>> VecMAXPY             100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  2  8  0  0  0   2  8  0  0  0   356
> >>> VecPointwiseMult     100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  2  4  0  0  0   2  4  0  0  0   183
> >>> VecScatterBegin       53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> >>> VecReduceArith       101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  8  0  0  0   0  8  0  0  0  2801
> >>> VecReduceComm         51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>> VecNormalize         100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  1 11  0  0  0   1 11  0  0  0  1568
> >>> VecCUSPCopyTo        152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> >>> VecCUSPCopyFrom      201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> >>> MatMult              100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  2 49  0  0  0   2 49  0  0  0  1825
> >>> MatAssemblyBegin       3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>> MatAssemblyEnd         3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> >>> MatZeroEntries         1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>> MatCUSPCopyTo          3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>> SNESSolve              1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00
> 0.0e+00
> >>> 3.7e+02 70100  0  0 88  70100  0  0 89   116
> >>> SNESFunctionEval      51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> >>> SNESJacobianEval      50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
> >>> SNESLineSearch        50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00
> 0.0e+00
> >>> 5.0e+01 20 45  0  0 12  20 45  0  0 12   184
> >>> KSPGMRESOrthog        50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00
> 0.0e+00
> >>> 5.0e+01  1  8  0  0 12   1  8  0  0 12   480
> >>> KSPSetUp              50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 1.5e+01  0  0  0  0  4   0  0  0  0  4     0
> >>> KSPSolve              50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00
> 0.0e+00
> >>> 3.2e+02 42 55  0  0 75  42 55  0  0 75   106
> >>> PCSetUp               50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00
> >>> 4.9e+01  6  0  0  0 12   6  0  0  0 12     0
> >>> PCApply              100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00
> >>> 4.0e+00  2  4  0  0  1   2  4  0  0  1   169
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------
> >>>
> >>> Memory usage is given in bytes:
> >>>
> >>> Object Type          Creations   Destructions     Memory  Descendants'
> >>> Mem.
> >>> Reports information only for process 0.
> >>>
> >>> --- Event Stage 0: Main Stage
> >>>
> >>>            Container     2              2         1096     0
> >>>               Vector    16             16    108696592     0
> >>>       Vector Scatter     2              2         1240     0
> >>>               Matrix     1              1     96326824     0
> >>>     Distributed Mesh     3              3      7775936     0
> >>>      Bipartite Graph     6              6         4104     0
> >>>            Index Set     5              5      3884908     0
> >>>    IS L to G Mapping     1              1      3881760     0
> >>>                 SNES     1              1         1268     0
> >>>       SNESLineSearch     1              1          840     0
> >>>               Viewer     1              0            0     0
> >>>        Krylov Solver     1              1        18288     0
> >>>       Preconditioner     1              1          792     0
> >>>
> >>>
> ========================================================================================================================
> >>> Average time to get PetscTime(): 9.53674e-08
> >>> #PETSc Option Table entries:
> >>> -da_vec_type cusp
> >>> -dm_mat_type seqaijcusp
> >>> -ksp_monitor
> >>> -log_summary
> >>> -pc_type jacobi
> >>> -snes_converged_reason
> >>> -snes_monitor
> >>> #End of PETSc Option Table entries
> >>> Compiled without FORTRAN kernels
> >>> Compiled with full precision matrices (default)
> >>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> >>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> >>> Configure run at: Fri Nov 16 08:40:52 2012
> >>> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
> >>> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0
> --CXXFLAGS=-O0
> >>> --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
> >>>
> --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"
> >>> --download-blacs --download-superlu_dist --download-triangle
> >>> --download-parmetis --download-metis --download-mumps
> --download-scalapack
> >>> --with-cuda=1 --with-cusp=1 --with-thrust=1
> >>> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
> >>> --download-exodusii=yes --download-netcdf --with-boost=1
> >>> --with-boost-dir=/usr --download-fiat=yes --download-generator
> >>> --download-scientificpython --with-matlab=1 --with-matlab-engine=1
> >>> --with-matlab-dir=/opt/MATLAB/R2011a
> >>> -----------------------------------------
> >>> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
> >>> Machine characteristics:
> >>> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid
> >>> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
> >>> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
> >>> -----------------------------------------
> >>>
> >>> Using C compiler: /usr/bin/mpicxx -O0 -g   -fPIC   ${COPTFLAGS}
> ${CFLAGS}
> >>> Using Fortran compiler: /usr/bin/mpif90  -fPIC -Wall
> -Wno-unused-variable
> >>> -g   ${FOPTFLAGS} ${FFLAGS}
> >>> -----------------------------------------
> >>>
> >>> Using include paths:
> >>>
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
> >>> -I/opt/apps/PETSC/petsc-3.3-p4/include
> >>> -I/opt/apps/PETSC/petsc-3.3-p4/include
> >>>
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
> >>> -I/opt/apps/cuda/4.2//cuda/include
> >>> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve
> >>> -I/opt/MATLAB/R2011a/extern/include -I/usr/include
> >>>
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include
> >>>
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include
> >>> -I/usr/include/mpich2
> >>> -----------------------------------------
> >>>
> >>> Using C linker: /usr/bin/mpicxx
> >>> Using Fortran linker: /usr/bin/mpif90
> >>> Using libraries:
> >>>
> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> >>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> >>> -lpetsc
> >>>
> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> >>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> >>> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps
> -lsmumps
> >>> -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs
> >>> -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64
> -L/opt/apps/cuda/4.2//cuda/lib64
> >>> -lcufft -lcublas -lcudart -lcusparse
> >>>
> -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64
> >>> -L/opt/MATLAB/R2011a/bin/glnxa64
> -L/opt/MATLAB/R2011a/extern/lib/glnxa64
> >>> -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc
> >>> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib
> >>> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread
> >>> -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf
> >>> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3
> >>> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm
> >>> -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread
> -lrt
> >>> -lgcc_s -ldl
> >>> -----------------------------------------
> >>>
> >>>
> >>>
> >>> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <knepley at gmail.com>
> >>> wrote:
> >>>>
> >>>> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <fuentesdt at gmail.com>
> >>>> wrote:
> >>>> > Hi,
> >>>> >
> >>>> > I'm using petsc 3.3p4
> >>>> > I'm trying to run a nonlinear SNES solver on GPU with gmres and
> jacobi
> >>>> > PC
> >>>> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and
> jacobian
> >>>> > matrix
> >>>> > respectively.
> >>>> > When running top I still see significant CPU utilization (800-900
> >>>> > %CPU)
> >>>> > during the solve ? possibly from some multithreaded operations ?
> >>>> >
> >>>> > Is this expected ?
> >>>> > I was thinking that since I input everything into the solver as a
> CUSP
> >>>> > datatype, all linear algebra operations would be on the GPU device
> >>>> > from
> >>>> > there and wasn't expecting to see such CPU utilization during the
> >>>> > solve ?
> >>>> > Do I probably have an error in my code somewhere ?
> >>>>
> >>>> We cannot answer performance questions without -log_summary
> >>>>
> >>>>    Matt
> >>>>
> >>>> > Thanks,
> >>>> > David
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> What most experimenters take for granted before they begin their
> >>>> experiments is infinitely more interesting than any results to which
> >>>> their experiments lead.
> >>>> -- Norbert Wiener
> >>>
> >>>
> >
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121117/6a2088e4/attachment-0001.html>

Reply via email to