gprof or some similar tool?
> On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev > <petsc-dev@mcs.anl.gov> wrote: > > -cuda_initialize 0 does not make any difference. Actually this issue has > nothing to do with PetscInitialize(). I tried to call cudaFree(0) before > PetscInitialize(), and it still took 7.5 seconds. > > Hong > >> On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote: >> >> As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize >> contains >> ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr); >> ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr); >> Have you tried to comment out them and test again? >> --Junchao Zhang >> >> >> On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev >> <petsc-dev@mcs.anl.gov> wrote: >> >> >>> On Feb 8, 2020, at 5:03 PM, Matthew Knepley <knep...@gmail.com> wrote: >>> >>> On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev >>> <petsc-dev@mcs.anl.gov> wrote: >>> I did some further investigation. The overhead persists for both the PETSc >>> shared library and the static library. In the previous example, it does not >>> call any PETSc function, the first CUDA function becomes very slow when it >>> is linked to the petsc so. This indicates that the slowdown occurs if the >>> symbol (cudafree)is searched through the petsc so, but does not occur if >>> the symbol is found directly in the cuda runtime lib. >>> >>> So the issue has nothing to do with the dynamic linker. The following >>> example can be used to easily reproduce the problem (cudaFree(0) always >>> takes ~7.5 seconds). >>> >>> 1) This should go to OLCF admin as Jeff suggests >> >> I had sent this to OLCF admin before the discussion was started here. Thomas >> Papatheodore has followed up. I am trying to help him reproduce the problem >> on summit. >> >>> >>> 2) Just to make sure I understand, a static executable with this code is >>> still slow on the cudaFree(), since CUDA is a shared library by default. >> >> I prepared the code as a minimal example to reproduce the problem. It would >> be fair to say any code using PETSc (with CUDA enabled, built statically or >> dynamically) on summit suffers a 7.5-second overhead on the first CUDA >> function call (either in the user code or inside PETSc). >> >> Thanks, >> Hong >> >>> >>> I think we should try: >>> >>> a) Forcing a full static link, if possible >>> >>> b) Asking OLCF about link resolution order >>> >>> It sounds like a similar thing I have seen in the past where link >>> resolution order can exponentially increase load time. >>> >>> Thanks, >>> >>> Matt >>> >>> bash-4.2$ cat ex_simple_petsc.c >>> #include <time.h> >>> #include <cuda_runtime.h> >>> #include <stdio.h> >>> #include <petscmat.h> >>> >>> int main(int argc,char **args) >>> { >>> clock_t start,s1,s2,s3; >>> double cputime; >>> double *init,tmp[100] = {0}; >>> PetscErrorCode ierr=0; >>> >>> ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr; >>> start = clock(); >>> cudaFree(0); >>> s1 = clock(); >>> cudaMalloc((void **)&init,100*sizeof(double)); >>> s2 = clock(); >>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice); >>> s3 = clock(); >>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - >>> start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) >>> (s3 - s2)) / CLOCKS_PER_SEC); >>> ierr = PetscFinalize(); >>> return ierr; >>> } >>> >>> Hong >>> >>>> On Feb 7, 2020, at 3:09 PM, Zhang, Hong <hongzh...@anl.gov> wrote: >>>> >>>> Note that the overhead was triggered by the first call to a CUDA function. >>>> So it seems that the first CUDA function triggered loading petsc so (if >>>> petsc so is linked), which is slow on the summit file system. >>>> >>>> Hong >>>> >>>>> On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev >>>>> <petsc-dev@mcs.anl.gov> wrote: >>>>> >>>>> Linking any other shared library does not slow down the execution. The >>>>> PETSc shared library is the only one causing trouble. >>>>> >>>>> Here are the ldd output for two different versions. For the first >>>>> version, I removed -lpetsc and it ran very fast. The second (slow) >>>>> version was linked to petsc so. >>>>> >>>>> bash-4.2$ ldd ex_simple >>>>> linux-vdso64.so.1 => (0x0000200000050000) >>>>> liblapack.so.0 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 >>>>> (0x0000200000070000) >>>>> libblas.so.0 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 >>>>> (0x00002000009b0000) >>>>> libhdf5hl_fortran.so.100 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 >>>>> (0x0000200000e80000) >>>>> libhdf5_fortran.so.100 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 >>>>> (0x0000200000ed0000) >>>>> libhdf5_hl.so.100 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 >>>>> (0x0000200000f50000) >>>>> libhdf5.so.103 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 >>>>> (0x0000200000fb0000) >>>>> libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000) >>>>> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 >>>>> (0x0000200001770000) >>>>> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 >>>>> (0x0000200009b00000) >>>>> libcudart.so.10.1 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x000020000d950000) >>>>> libcusparse.so.10 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x000020000d9f0000) >>>>> libcusolver.so.10 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200012f50000) >>>>> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000) >>>>> libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000) >>>>> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000) >>>>> libmpiprofilesupport.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 >>>>> (0x000020001de40000) >>>>> libmpi_ibm_usempi.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so >>>>> (0x000020001de70000) >>>>> libmpi_ibm_mpifh.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 >>>>> (0x000020001dea0000) >>>>> libmpi_ibm.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 >>>>> (0x000020001df40000) >>>>> libpgf90rtl.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so >>>>> (0x000020001e0b0000) >>>>> libpgf90.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so >>>>> (0x000020001e0f0000) >>>>> libpgf90_rpm1.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so >>>>> (0x000020001e6a0000) >>>>> libpgf902.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so >>>>> (0x000020001e6d0000) >>>>> libpgftnrtl.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so >>>>> (0x000020001e700000) >>>>> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000) >>>>> libpgkomp.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so >>>>> (0x000020001e760000) >>>>> libomp.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so >>>>> (0x000020001e790000) >>>>> libomptarget.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so >>>>> (0x000020001e880000) >>>>> libpgmath.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so >>>>> (0x000020001e8b0000) >>>>> libpgc.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so >>>>> (0x000020001e9d0000) >>>>> librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000) >>>>> libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000) >>>>> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000) >>>>> libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000) >>>>> libz.so.1 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 >>>>> (0x000020001ee90000) >>>>> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000) >>>>> /lib64/ld64.so.2 (0x0000200000000000) >>>>> libcublasLt.so.10 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x000020001ef40000) >>>>> libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000) >>>>> libhwloc_ompi.so.15 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 >>>>> (0x0000200020e80000) >>>>> libevent-2.1.so.6 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 >>>>> (0x0000200020ef0000) >>>>> libevent_pthreads-2.1.so.6 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 >>>>> (0x0000200020f70000) >>>>> libopen-rte.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 >>>>> (0x0000200020fa0000) >>>>> libopen-pal.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 >>>>> (0x00002000210b0000) >>>>> libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000) >>>>> >>>>> >>>>> bash-4.2$ ldd ex_simple_slow >>>>> linux-vdso64.so.1 => (0x0000200000050000) >>>>> libpetsc.so.3.012 => >>>>> /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012 >>>>> (0x0000200000070000) >>>>> liblapack.so.0 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 >>>>> (0x0000200002be0000) >>>>> libblas.so.0 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 >>>>> (0x0000200003520000) >>>>> libhdf5hl_fortran.so.100 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 >>>>> (0x00002000039f0000) >>>>> libhdf5_fortran.so.100 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 >>>>> (0x0000200003a40000) >>>>> libhdf5_hl.so.100 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 >>>>> (0x0000200003ac0000) >>>>> libhdf5.so.103 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 >>>>> (0x0000200003b20000) >>>>> libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000) >>>>> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 >>>>> (0x00002000042e0000) >>>>> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 >>>>> (0x000020000c670000) >>>>> libcudart.so.10.1 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x00002000104c0000) >>>>> libcusparse.so.10 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x0000200010560000) >>>>> libcusolver.so.10 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200015ac0000) >>>>> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000) >>>>> libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000) >>>>> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000) >>>>> libmpiprofilesupport.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 >>>>> (0x00002000209b0000) >>>>> libmpi_ibm_usempi.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so >>>>> (0x00002000209e0000) >>>>> libmpi_ibm_mpifh.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 >>>>> (0x0000200020a10000) >>>>> libmpi_ibm.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 >>>>> (0x0000200020ab0000) >>>>> libpgf90rtl.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so >>>>> (0x0000200020c20000) >>>>> libpgf90.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so >>>>> (0x0000200020c60000) >>>>> libpgf90_rpm1.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so >>>>> (0x0000200021210000) >>>>> libpgf902.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so >>>>> (0x0000200021240000) >>>>> libpgftnrtl.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so >>>>> (0x0000200021270000) >>>>> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000) >>>>> libpgkomp.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so >>>>> (0x00002000212d0000) >>>>> libomp.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so >>>>> (0x0000200021300000) >>>>> libomptarget.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so >>>>> (0x00002000213f0000) >>>>> libpgmath.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so >>>>> (0x0000200021420000) >>>>> libpgc.so => >>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so >>>>> (0x0000200021540000) >>>>> librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000) >>>>> libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000) >>>>> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000) >>>>> libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000) >>>>> libz.so.1 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 >>>>> (0x0000200021a10000) >>>>> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000) >>>>> /lib64/ld64.so.2 (0x0000200000000000) >>>>> libcublasLt.so.10 => >>>>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x0000200021ab0000) >>>>> libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000) >>>>> libhwloc_ompi.so.15 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 >>>>> (0x00002000239f0000) >>>>> libevent-2.1.so.6 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 >>>>> (0x0000200023a60000) >>>>> libevent_pthreads-2.1.so.6 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 >>>>> (0x0000200023ae0000) >>>>> libopen-rte.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 >>>>> (0x0000200023b10000) >>>>> libopen-pal.so.3 => >>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 >>>>> (0x0000200023c20000) >>>>> libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000) >>>>> >>>>> >>>>>> On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote: >>>>>> >>>>>> >>>>>> ldd -o on the executable of both linkings of your code. >>>>>> >>>>>> My guess is that without PETSc it is linking the static version of the >>>>>> needed libraries and with PETSc the shared. And, in typical fashion, the >>>>>> shared libraries are off on some super slow file system so take a long >>>>>> time to be loaded and linked in on demand. >>>>>> >>>>>> Still a performance bug in Summit. >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>>> On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev >>>>>>> <petsc-dev@mcs.anl.gov> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Previously I have noticed that the first call to a CUDA function such >>>>>>> as cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on >>>>>>> summit. Then I prepared a simple example as attached to help OCLF >>>>>>> reproduce the problem. It turned out that the problem was caused by >>>>>>> PETSc. The 7.5-second overhead can be observed only when the PETSc lib >>>>>>> is linked. If I do not link PETSc, it runs normally. Does anyone have >>>>>>> any idea why this happens and how to fix it? >>>>>>> >>>>>>> Hong (Mr.) >>>>>>> >>>>>>> bash-4.2$ cat ex_simple.c >>>>>>> #include <time.h> >>>>>>> #include <cuda_runtime.h> >>>>>>> #include <stdio.h> >>>>>>> >>>>>>> int main(int argc,char **args) >>>>>>> { >>>>>>> clock_t start,s1,s2,s3; >>>>>>> double cputime; >>>>>>> double *init,tmp[100] = {0}; >>>>>>> >>>>>>> start = clock(); >>>>>>> cudaFree(0); >>>>>>> s1 = clock(); >>>>>>> cudaMalloc((void **)&init,100*sizeof(double)); >>>>>>> s2 = clock(); >>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice); >>>>>>> s3 = clock(); >>>>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 >>>>>>> - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / >>>>>>> CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC); >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >> >