thanks Matt,

My log summary is below.

***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***

---------------------------------------------- PETSc Performance Summary:

./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named SCRGP2
with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012

                         Max       Max/Min        Avg      Total
Time (sec):           3.164e+01      1.00000   3.164e+01
Objects:              4.100e+01      1.00000   4.100e+01
Flops:                2.561e+09      1.00000   2.561e+09  2.561e+09
Flops/sec:            8.097e+07      1.00000   8.097e+07  8.097e+07
Memory:               2.129e+08      1.00000              2.129e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       4.230e+02      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
                            e.g., VecAXPY() for real vectors of length N
--> 2N flops
                            and VecAXPY() for complex vectors of length N
--> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
 -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total
    Avg         %Total   counts   %Total
 0:      Main Stage: 3.1636e+01 100.0%  2.5615e+09 100.0%  0.000e+00   0.0%
 0.000e+00        0.0%  4.220e+02  99.8%

See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
      %T - percent time in this phase         %f - percent flops in this
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)

      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #

Event                Count      Time (sec)     Flops
      --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s

--- Event Stage 0: Main Stage

ComputeFunction       52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
3.0e+00  1  0  0  0  1   1  0  0  0  1     0
VecDot                50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00  0  4  0  0  0   0  4  0  0  0  3025
VecMDot               50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00  0  4  0  0  0   0  4  0  0  0   741
VecNorm              200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0 15  0  0  0   0 15  0  0  0  3963
VecScale             100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00  0  4  0  0  0   0  4  0  0  0   719
VecCopy              150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecSet               164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY               50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00  0  4  0  0  0   0  4  0  0  0  3014
VecWAXPY              50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00
0.0e+00  1  2  0  0  0   1  2  0  0  0   167
VecMAXPY             100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
0.0e+00  2  8  0  0  0   2  8  0  0  0   356
VecPointwiseMult     100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00  2  4  0  0  0   2  4  0  0  0   183
VecScatterBegin       53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecReduceArith       101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  8  0  0  0   0  8  0  0  0  2801
VecReduceComm         51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize         100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00
0.0e+00  1 11  0  0  0   1 11  0  0  0  1568
VecCUSPCopyTo        152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecCUSPCopyFrom      201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatMult              100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00
0.0e+00  2 49  0  0  0   2 49  0  0  0  1825
MatAssemblyBegin       3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatZeroEntries         1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCUSPCopyTo          3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SNESSolve              1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00
3.7e+02 70100  0  0 88  70100  0  0 89   116
SNESFunctionEval      51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
SNESJacobianEval      50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  4  0  0  0  0   4  0  0  0  0     0
SNESLineSearch        50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
5.0e+01 20 45  0  0 12  20 45  0  0 12   184
KSPGMRESOrthog        50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
5.0e+01  1  8  0  0 12   1  8  0  0 12   480
KSPSetUp              50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.5e+01  0  0  0  0  4   0  0  0  0  4     0
KSPSolve              50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00
3.2e+02 42 55  0  0 75  42 55  0  0 75   106
PCSetUp               50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.9e+01  6  0  0  0 12   6  0  0  0 12     0
PCApply              100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
4.0e+00  2  4  0  0  1   2  4  0  0  1   169

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

           Container     2              2         1096     0
              Vector    16             16    108696592     0
      Vector Scatter     2              2         1240     0
              Matrix     1              1     96326824     0
    Distributed Mesh     3              3      7775936     0
     Bipartite Graph     6              6         4104     0
           Index Set     5              5      3884908     0
   IS L to G Mapping     1              1      3881760     0
                SNES     1              1         1268     0
      SNESLineSearch     1              1          840     0
              Viewer     1              0            0     0
       Krylov Solver     1              1        18288     0
      Preconditioner     1              1          792     0
Average time to get PetscTime(): 9.53674e-08
#PETSc Option Table entries:
-da_vec_type cusp
-dm_mat_type seqaijcusp
-pc_type jacobi
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Fri Nov 16 08:40:52 2012
Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
--with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0
--CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
--download-blacs --download-superlu_dist --download-triangle
--download-parmetis --download-metis --download-mumps --download-scalapack
--with-cuda=1 --with-cusp=1 --with-thrust=1
--with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
--download-exodusii=yes --download-netcdf --with-boost=1
--with-boost-dir=/usr --download-fiat=yes --download-generator
--download-scientificpython --with-matlab=1 --with-matlab-engine=1
Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
Machine characteristics:
Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg

Using C compiler: /usr/bin/mpicxx -O0 -g   -fPIC   ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /usr/bin/mpif90  -fPIC -Wall -Wno-unused-variable

Using include paths:
-I/opt/MATLAB/R2011a/extern/include -I/usr/include

Using C linker: /usr/bin/mpicxx
Using Fortran linker: /usr/bin/mpif90
Using libraries:
-ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs
-Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64
-lcufft -lcublas -lcudart -lcusparse
-L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64
-leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc
-L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread
-lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf
-L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm
-lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt
-lgcc_s -ldl

On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <knepley at> wrote:

> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <fuentesdt at>
> wrote:
> > Hi,
> >
> > I'm using petsc 3.3p4
> > I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi PC
> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian
> matrix
> > respectively.
> > When running top I still see significant CPU utilization (800-900 %CPU)
> > during the solve ? possibly from some multithreaded operations ?
> >
> > Is this expected ?
> > I was thinking that since I input everything into the solver as a CUSP
> > datatype, all linear algebra operations would be on the GPU device from
> > there and wasn't expecting to see such CPU utilization during the solve ?
> > Do I probably have an error in my code somewhere ?
> We cannot answer performance questions without -log_summary
>    Matt
> > Thanks,
> > David
