Hi again, On 01. mars 2013 20:06, Jed Brown wrote: > > Matrix and vector operations are probably running in parallel, but probably > not the operations that are taking time. Always send -log_summary if you > have a performance question. >
I don't think they are running in parallel. When I analyze my code in Intel Vtune Amplifier, the only routines running in parallel are my own OpenMP ones. Indeed, if I comment out my OpenMP pragmas and recompile my code, it never uses more than one thread. -log_summary is shown below; this is using -pc_type lu -ksp_type bcgs. The fastest PC for my cases is usually BoomerAMG from HYPRE, so i used LU instead here in order to limit the test to PETSc only. The summary agrees with Vtune that MatLUFactorNumeric is the most time-consuming routine; in general it seems that the PC is always the most time-consuming. Any advice on how to get OpenMP working? Regards, ?smund ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./run on a arch-linux2-c-opt named vsl161 with 1 processor, by asmunder Wed Mar 6 10:14:55 2013 Using Petsc Development HG revision: 58cc6199509f1642f637843f1ca468283bf5ced9 HG Date: Wed Jan 30 00:39:35 2013 -0600 Max Max/Min Avg Total Time (sec): 4.446e+02 1.00000 4.446e+02 Objects: 2.017e+03 1.00000 2.017e+03 Flops: 3.919e+11 1.00000 3.919e+11 3.919e+11 Flops/sec: 8.815e+08 1.00000 8.815e+08 8.815e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 2.818e+03 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 4.4460e+02 100.0% 3.9191e+11 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.817e+03 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %f - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flops --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage VecDot 802 1.0 9.2811e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2117 VecDotNorm2 401 1.0 7.1333e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00 4.0e+02 0 0 0 0 14 0 0 0 0 14 2755 VecNorm 1203 1.0 7.8265e-02 1.0 2.95e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 3766 VecCopy 802 1.0 1.1754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 1211 1.0 9.9961e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 401 1.0 4.5847e-02 1.0 9.82e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2143 VecAXPBYCZ 802 1.0 1.3489e-01 1.0 3.93e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2913 VecWAXPY 802 1.0 1.2292e-01 1.0 1.96e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1599 VecAssemblyBegin 802 1.0 2.4509e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 802 1.0 6.7234e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMult 1203 1.0 1.1513e+00 1.0 1.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1149 MatSolve 1604 1.0 1.4714e+01 1.0 2.07e+10 1.0 0.0e+00 0.0e+00 0.0e+00 3 5 0 0 0 3 5 0 0 0 1405 MatLUFactorSym 401 1.0 4.0197e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+03 9 0 0 0 43 9 0 0 0 43 0 MatLUFactorNum 401 1.0 2.3728e+02 1.0 3.69e+11 1.0 0.0e+00 0.0e+00 0.0e+00 53 94 0 0 0 53 94 0 0 0 1553 MatAssemblyBegin 401 1.0 1.7977e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 401 1.0 3.1975e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 401 1.0 9.1545e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 401 1.0 2.0361e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+02 5 0 0 0 28 5 0 0 0 28 0 KSPSetUp 401 1.0 4.1821e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 401 1.0 3.1511e+02 1.0 3.92e+11 1.0 0.0e+00 0.0e+00 2.8e+03 71100 0 0100 71100 0 0100 1244 PCSetUp 401 1.0 2.9844e+02 1.0 3.69e+11 1.0 0.0e+00 0.0e+00 2.0e+03 67 94 0 0 71 67 94 0 0 71 1235 PCApply 1604 1.0 1.4717e+01 1.0 2.07e+10 1.0 0.0e+00 0.0e+00 0.0e+00 3 5 0 0 0 3 5 0 0 0 1405 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Vector 409 409 401422048 0 Matrix 402 402 31321054412 0 Krylov Solver 1 1 1128 0 Preconditioner 1 1 1152 0 Index Set 1203 1203 393903904 0 Viewer 1 0 0 0 ======================================================================================================================== Average time to get PetscTime(): 9.53674e-08 #PETSc Option Table entries: -ksp_type bcgs -log_summary -pc_type lu #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4 Configure run at: Fri Mar 1 12:53:06 2013 Configure options: --with-pthreadclasses --with-openmp --with-debugging=0 --with-shared-libraries=1 --download-mpich --download-hypre --with-boost-dir=/usr COPTFLAGS=-O3 FOPTFLAGS=-O3 ----------------------------------------- Libraries compiled on Fri Mar 1 12:53:06 2013 on vsl161 Machine characteristics: Linux-3.7.9-1-ARCH-x86_64-with-glibc2.2.5 Using PETSc directory: /opt/petsc/petsc-dev-install Using PETSc arch: arch-linux2-c-opt ----------------------------------------- Using C compiler: /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O3 -fopenmp ${COPTFLAGS} ${CFLAGS} Using Fortran compiler: /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpif90 -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -O3 -fopenmp ${FOPTFLAGS} ${FFLAGS} ----------------------------------------- Using include paths: -I/opt/petsc/petsc-dev-install/arch-linux2-c-opt/include -I/opt/petsc/petsc-dev-install/include -I/opt/petsc/petsc-dev-install/include -I/opt/petsc/petsc-dev-install/arch-linux2-c-opt/include -I/usr/include ----------------------------------------- Using C linker: /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpicc Using Fortran linker: /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpif90 Using libraries: -Wl,-rpath,/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib -L/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib -L/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib -lHYPRE -Wl,-rpath,/usr/lib/gcc/x86_64-unknown-linux-gnu/4.7.2 -L/usr/lib/gcc/x86_64-unknown-linux-gnu/4.7.2 -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64 -L/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64 -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/ipp/lib/intel64 -L/opt/intel/composer_xe_2013.1.117/ipp/lib/intel64 -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64 -L/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64 -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/tbb/lib/intel64 -L/opt/intel/composer_xe_2013.1.117/tbb/lib/intel64 -lmpichcxx -lstdc++ -llapack -lblas -lX11 -lpthread -lm -lmpichf90 -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lmpl -lrt -lgcc_s -ldl -----------------------------------------