> On Feb 13, 2020, at 5:39 PM, Zhang, Hong <hongzh...@anl.gov> wrote:
> 
> 
> 
>> On Feb 13, 2020, at 7:39 AM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
>> 
>> 
>> How are the two being compiled and linked? The same way, one with the PETSc 
>> library in the path and the other without? Or does the PETSc one have lots 
>> of flags and stuff while the non-PETSc one is just simple by hand?
> 
> PETSc was built into a static lib. Then both of the two example were built 
> with the static lib.

  Understood. I meant the exact link lines for all.


> 
> Hong
> 
> 
>> 
>> Barry
>> 
>> 
>>> On Feb 12, 2020, at 7:29 PM, Zhang, Hong <hongzh...@anl.gov> wrote:
>>> 
>>> 
>>> 
>>>> On Feb 12, 2020, at 5:11 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
>>>> 
>>>> 
>>>> ldd -o on the petsc program (static) and the non petsc program (static), 
>>>> what are the differences?
>>> 
>>> There is no difference in the outputs.
>>> 
>>>> 
>>>> nm -o both executables | grep cudaFree()
>>> 
>>> Non petsc program:
>>> 
>>> [hongzh@login3.summit tests]$ nm ex_simple | grep cudaFree
>>> 0000000010000ae0 t 00000017.plt_call.cudaFree@@libcudart.so.10.1
>>>             U cudaFree@@libcudart.so.10.1
>>> 
>>> Petsc program:
>>> 
>>> [hongzh@login3.summit tests]$ nm ex_simple_petsc | grep cudaFree
>>> 0000000010016550 t 00000017.plt_call.cudaFree@@libcudart.so.10.1
>>> 0000000010017010 t 00000017.plt_call.cudaFreeHost@@libcudart.so.10.1
>>> 00000000124c3f48 V 
>>> _ZGVZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resou
>>> rceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_
>>> 8cuda_cub7pointerIvEEEEEEEEPT_vE8resource
>>> 00000000124c3f50 V 
>>> _ZGVZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda
>>> _memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8r
>>> esource
>>> 0000000010726788 W 
>>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE11do_allocateEmm
>>> 00000000107267e8 W 
>>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE13do_deallocateENS_10device_ptrIvEEmm
>>> 0000000010726878 W 
>>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED0Ev
>>> 0000000010726848 W 
>>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED1Ev
>>> 0000000010729f78 W 
>>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE11do_allocateEmm
>>> 000000001072a218 W 
>>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE13do_deallocateES6_mm
>>> 000000001072a388 W 
>>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED0Ev
>>> 000000001072a358 W 
>>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED1Ev
>>> 0000000012122300 V 
>>> _ZTIN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
>>> 0000000012122370 V 
>>> _ZTIN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
>>> 0000000012122410 V 
>>> _ZTSN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
>>> 00000000121225f0 V 
>>> _ZTSN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
>>> 0000000012120630 V 
>>> _ZTVN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
>>> 00000000121205b0 V 
>>> _ZTVN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
>>> 00000000124c3f30 V 
>>> _ZZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEEEPT_vE8resource
>>> 00000000124c3f20 V 
>>> _ZZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8resource
>>>             U cudaFree@@libcudart.so.10.1
>>>             U cudaFreeHost@@libcudart.so.10.1
>>> 
>>> Hong
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Feb 12, 2020, at 1:51 PM, Munson, Todd via petsc-dev 
>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>> 
>>>>> 
>>>>> There are some side effects when loading shared libraries, such as 
>>>>> initializations of
>>>>> static variables, etc.  Is something like that happening?
>>>>> 
>>>>> Another place is the initial runtime library that gets linked (libcrt0 
>>>>> maybe?).  I 
>>>>> think some MPI compilers insert their own version.
>>>>> 
>>>>> Todd.
>>>>> 
>>>>>> On Feb 12, 2020, at 11:38 AM, Zhang, Hong via petsc-dev 
>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 12, 2020, at 11:09 AM, Matthew Knepley <knep...@gmail.com> wrote:
>>>>>>> 
>>>>>>> On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev 
>>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>>> Sorry for the long post. Here are replies I have got from OLCF so far. 
>>>>>>> We still don’t know how to solve the problem.
>>>>>>> 
>>>>>>> One interesting thing that Tom noticed is PetscInitialize() may have 
>>>>>>> called cudaFree(0) 32 times as NVPROF shows, and they all run very 
>>>>>>> fast. These calls may be triggered by some other libraries like cublas. 
>>>>>>> But if PETSc calls cudaFree() explicitly, it is always very slow.
>>>>>>> 
>>>>>>> It sounds really painful, but I would start removing lines from 
>>>>>>> PetscInitialize() until it runs fast.
>>>>>> 
>>>>>> It may be more painful than it sounds. The problem is not really related 
>>>>>> to PetscInitialize(). In the following simple example, we do not call 
>>>>>> any PETsc function. But if we link it to the PETSc shared library, 
>>>>>> cudaFree(0) would be very slow. CUDA is a blackbox. There is not much we 
>>>>>> can debug with this simple example.
>>>>>> 
>>>>>> bash-4.2$ cat ex_simple.c
>>>>>> #include <time.h>
>>>>>> #include <cuda_runtime.h>
>>>>>> #include <stdio.h>
>>>>>> 
>>>>>> int main(int argc,char **args)
>>>>>> {
>>>>>> clock_t start,s1,s2,s3;
>>>>>> double  cputime;
>>>>>> double   *init,tmp[100] = {0};
>>>>>> 
>>>>>> start = clock();
>>>>>> cudaFree(0);
>>>>>> s1 = clock();
>>>>>> cudaMalloc((void **)&init,100*sizeof(double));
>>>>>> s2 = clock();
>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>>>>> s3 = clock();
>>>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 
>>>>>> - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / 
>>>>>> CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>>>>>> return 0;
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Matt
>>>>>>> 
>>>>>>> Hong
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed Feb 12 09:51:33 2020, tpapathe wrote:
>>>>>>> 
>>>>>>> Something else I noticed from the nvprof output (see my previous post) 
>>>>>>> is
>>>>>>> that the runs with PETSc initialized have 33 calls to cudaFree, whereas 
>>>>>>> the
>>>>>>> non-PETSc versions only have the 1 call to cudaFree. I'm not sure what 
>>>>>>> is
>>>>>>> happening in the PETSc initialize/finalize, but it appears to be doing a
>>>>>>> lot under the hood. You can also see there are many additional CUDA 
>>>>>>> calls
>>>>>>> that are not shown in the profiler output from the non-PETSc runs (e.g.,
>>>>>>> additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.).
>>>>>>> Which other systems have you tested this on? Which CUDA Toolkits and 
>>>>>>> CUDA
>>>>>>> drivers were installed on those systems? Please let me know if there is 
>>>>>>> any
>>>>>>> additional information you can share with me about this.
>>>>>>> 
>>>>>>> -Tom
>>>>>>> On Wed Feb 12 09:25:23 2020, tpapathe wrote:
>>>>>>> 
>>>>>>> Ok. Thanks for the additional info, Hong. I'll ask around to see if any
>>>>>>> local (PETSc or CUDA) experts have experienced this behavior. In the
>>>>>>> meantime, is this impacting your work or something you're just curious
>>>>>>> about? A 5-7 second initialization time is indeed unusual, but is it
>>>>>>> negligible relative to the overall walltime of your jobs, or is it
>>>>>>> somehow affecting your productivity?
>>>>>>> 
>>>>>>> -Tom
>>>>>>> On Tue Feb 11 17:04:25 2020, hongzh...@anl.gov wrote:
>>>>>>> 
>>>>>>> We know it happens with PETSc. But note that the slow down occurs on 
>>>>>>> the first CUDA function call. In the example I sent to you, if we 
>>>>>>> simply link it to the PETSc shared library and don’t call any PETSc 
>>>>>>> function, the slow down still happens on cudaFree(0). We have never 
>>>>>>> seen this behavior on other GPU systems.
>>>>>>> 
>>>>>>> On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT <h...@nccs.gov> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thanks for the update. I have now reproduced the behavior you described 
>>>>>>> with
>>>>>>> PETSc + CUDA using your example code:
>>>>>>> 
>>>>>>> [tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ 
>>>>>>> jsrun -n1
>>>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>>>>>> 
>>>>>>> ==16991== NVPROF is profiling process 16991, command:
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>>>>>> 
>>>>>>> ==16991== Profiling application:
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>>>>>> 
>>>>>>> free time =4.730000 malloc time =0.000000 copy time =0.000000
>>>>>>> 
>>>>>>> ==16991== Profiling result:
>>>>>>> 
>>>>>>> Type Time(%) Time Calls Avg Min Max Name
>>>>>>> 
>>>>>>> GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA 
>>>>>>> memcpy
>>>>>>> HtoD]
>>>>>>> 
>>>>>>> API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree
>>>>>>> 
>>>>>>> 0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute
>>>>>>> 
>>>>>>> 0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem
>>>>>>> 
>>>>>>> 0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName
>>>>>>> 
>>>>>>> 0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc
>>>>>>> 
>>>>>>> 0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute
>>>>>>> 
>>>>>>> 0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy
>>>>>>> 
>>>>>>> 0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize
>>>>>>> 
>>>>>>> 0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags
>>>>>>> 
>>>>>>> 0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy
>>>>>>> 
>>>>>>> 0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute
>>>>>>> 
>>>>>>> 0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy
>>>>>>> 
>>>>>>> 0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags
>>>>>>> 
>>>>>>> 0.00% 11.955us 1 11.955us 11.955us 11.955us cudaSetDevice
>>>>>>> 
>>>>>>> 0.00% 10.361us 7 1.4800us 809ns 3.6580us cudaGetDevice
>>>>>>> 
>>>>>>> 0.00% 5.4310us 3 1.8100us 1.6420us 1.9980us cudaEventCreate
>>>>>>> 
>>>>>>> 0.00% 3.8040us 6 634ns 391ns 1.5350us cuDeviceGetCount
>>>>>>> 
>>>>>>> 0.00% 3.5350us 1 3.5350us 3.5350us 3.5350us cuDeviceGetPCIBusId
>>>>>>> 
>>>>>>> 0.00% 3.2210us 3 1.0730us 949ns 1.1640us cuInit
>>>>>>> 
>>>>>>> 0.00% 2.6780us 5 535ns 369ns 1.0210us cuDeviceGet
>>>>>>> 
>>>>>>> 0.00% 2.5080us 1 2.5080us 2.5080us 2.5080us cudaSetDeviceFlags
>>>>>>> 
>>>>>>> 0.00% 1.6800us 4 420ns 392ns 488ns cuDeviceGetUuid
>>>>>>> 
>>>>>>> 0.00% 1.5720us 3 524ns 398ns 590ns cuDriverGetVersion
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If I remove all mention of PETSc from the code, compile manually and 
>>>>>>> run, I get
>>>>>>> the expected behavior:
>>>>>>> 
>>>>>>> [tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ pgc++
>>>>>>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple.c -o ex_simple
>>>>>>> 
>>>>>>> 
>>>>>>> [tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ 
>>>>>>> jsrun -n1
>>>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>>>>>> 
>>>>>>> ==17248== NVPROF is profiling process 17248, command:
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>>>>>> 
>>>>>>> ==17248== Profiling application:
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>>>>>> 
>>>>>>> free time =0.340000 malloc time =0.000000 copy time =0.000000
>>>>>>> 
>>>>>>> ==17248== Profiling result:
>>>>>>> 
>>>>>>> Type Time(%) Time Calls Avg Min Max Name
>>>>>>> 
>>>>>>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA 
>>>>>>> memcpy
>>>>>>> HtoD]
>>>>>>> 
>>>>>>> API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree
>>>>>>> 
>>>>>>> 0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute
>>>>>>> 
>>>>>>> 0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem
>>>>>>> 
>>>>>>> 0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc
>>>>>>> 
>>>>>>> 0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName
>>>>>>> 
>>>>>>> 0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy
>>>>>>> 
>>>>>>> 0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId
>>>>>>> 
>>>>>>> 0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount
>>>>>>> 
>>>>>>> 0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet
>>>>>>> 
>>>>>>> 0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I also get the expected behavior if I add an MPI_Init and MPI_Finalize 
>>>>>>> to the
>>>>>>> code instead of PETSc initialization:
>>>>>>> 
>>>>>>> [tpapathe@login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc
>>>>>>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi
>>>>>>> 
>>>>>>> 
>>>>>>> [tpapathe@batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ 
>>>>>>> jsrun -n1
>>>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>>>>>> 
>>>>>>> ==35166== NVPROF is profiling process 35166, command:
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>>>>>> 
>>>>>>> ==35166== Profiling application:
>>>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>>>>>> 
>>>>>>> free time =0.340000 malloc time =0.000000 copy time =0.000000
>>>>>>> 
>>>>>>> ==35166== Profiling result:
>>>>>>> 
>>>>>>> Type Time(%) Time Calls Avg Min Max Name
>>>>>>> 
>>>>>>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA 
>>>>>>> memcpy
>>>>>>> HtoD]
>>>>>>> 
>>>>>>> API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree
>>>>>>> 
>>>>>>> 0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute
>>>>>>> 
>>>>>>> 0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem
>>>>>>> 
>>>>>>> 0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc
>>>>>>> 
>>>>>>> 0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName
>>>>>>> 
>>>>>>> 0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy
>>>>>>> 
>>>>>>> 0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId
>>>>>>> 
>>>>>>> 0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount
>>>>>>> 
>>>>>>> 0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet
>>>>>>> 
>>>>>>> 0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid
>>>>>>> 
>>>>>>> 
>>>>>>> So this appears to be something specific happening within PETSc itself 
>>>>>>> - not
>>>>>>> necessarily an OLCF issue. I would suggest asking this question within 
>>>>>>> the
>>>>>>> PETSc community to understand what's happening. Please let me know if 
>>>>>>> you have
>>>>>>> any additional questions.
>>>>>>> 
>>>>>>> -Tom
>>>>>>> 
>>>>>>>> On Feb 10, 2020, at 11:14 AM, Smith, Barry F. <bsm...@mcs.anl.gov> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> gprof or some similar tool?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev 
>>>>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>>>>> 
>>>>>>>>> -cuda_initialize 0 does not make any difference. Actually this issue 
>>>>>>>>> has nothing to do with PetscInitialize(). I tried to call cudaFree(0) 
>>>>>>>>> before PetscInitialize(), and it still took 7.5 seconds.
>>>>>>>>> 
>>>>>>>>> Hong
>>>>>>>>> 
>>>>>>>>>> On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <jczh...@mcs.anl.gov> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> As I mentioned, have you tried -cuda_initialize 0? Also, 
>>>>>>>>>> PetscCUDAInitialize contains
>>>>>>>>>> ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);
>>>>>>>>>> ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);
>>>>>>>>>> Have you tried to comment out them and test again?
>>>>>>>>>> --Junchao Zhang
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev 
>>>>>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Feb 8, 2020, at 5:03 PM, Matthew Knepley <knep...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev 
>>>>>>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>>>>>>> I did some further investigation. The overhead persists for both 
>>>>>>>>>>> the PETSc shared library and the static library. In the previous 
>>>>>>>>>>> example, it does not call any PETSc function, the first CUDA 
>>>>>>>>>>> function becomes very slow when it is linked to the petsc so. This 
>>>>>>>>>>> indicates that the slowdown occurs if the symbol (cudafree)is 
>>>>>>>>>>> searched through the petsc so, but does not occur if the symbol is 
>>>>>>>>>>> found directly in the cuda runtime lib. 
>>>>>>>>>>> 
>>>>>>>>>>> So the issue has nothing to do with the dynamic linker. The 
>>>>>>>>>>> following example can be used to easily reproduce the problem 
>>>>>>>>>>> (cudaFree(0) always takes ~7.5 seconds).  
>>>>>>>>>>> 
>>>>>>>>>>> 1) This should go to OLCF admin as Jeff suggests
>>>>>>>>>> 
>>>>>>>>>> I had sent this to OLCF admin before the discussion was started 
>>>>>>>>>> here. Thomas Papatheodore has followed up. I am trying to help him 
>>>>>>>>>> reproduce the problem on summit. 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2) Just to make sure I understand, a static executable with this 
>>>>>>>>>>> code is still slow on the cudaFree(), since CUDA is a shared 
>>>>>>>>>>> library by default.
>>>>>>>>>> 
>>>>>>>>>> I prepared the code as a minimal example to reproduce the problem. 
>>>>>>>>>> It would be fair to say any code using PETSc (with CUDA enabled, 
>>>>>>>>>> built statically or dynamically) on summit suffers a 7.5-second 
>>>>>>>>>> overhead on the first CUDA function call (either in the user code or 
>>>>>>>>>> inside PETSc).
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Hong
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I think we should try:
>>>>>>>>>>> 
>>>>>>>>>>> a) Forcing a full static link, if possible
>>>>>>>>>>> 
>>>>>>>>>>> b) Asking OLCF about link resolution order
>>>>>>>>>>> 
>>>>>>>>>>> It sounds like a similar thing I have seen in the past where link 
>>>>>>>>>>> resolution order can exponentially increase load time.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Matt
>>>>>>>>>>> 
>>>>>>>>>>> bash-4.2$ cat ex_simple_petsc.c
>>>>>>>>>>> #include <time.h>
>>>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>>>> #include <stdio.h>
>>>>>>>>>>> #include <petscmat.h>
>>>>>>>>>>> 
>>>>>>>>>>> int main(int argc,char **args)
>>>>>>>>>>> {
>>>>>>>>>>> clock_t start,s1,s2,s3;
>>>>>>>>>>> double  cputime;
>>>>>>>>>>> double  *init,tmp[100] = {0};
>>>>>>>>>>> PetscErrorCode ierr=0;
>>>>>>>>>>> 
>>>>>>>>>>> ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return 
>>>>>>>>>>> ierr;
>>>>>>>>>>> start = clock();
>>>>>>>>>>> cudaFree(0);
>>>>>>>>>>> s1 = clock();
>>>>>>>>>>> cudaMalloc((void **)&init,100*sizeof(double));
>>>>>>>>>>> s2 = clock();
>>>>>>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>>>>>>>>>> s3 = clock();
>>>>>>>>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) 
>>>>>>>>>>> (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / 
>>>>>>>>>>> CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>>>>>>>>>>> ierr = PetscFinalize();
>>>>>>>>>>> return ierr;
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> Hong
>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 7, 2020, at 3:09 PM, Zhang, Hong <hongzh...@anl.gov> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that the overhead was triggered by the first call to a CUDA 
>>>>>>>>>>>> function. So it seems that the first CUDA function triggered 
>>>>>>>>>>>> loading petsc so (if petsc so is linked), which is slow on the 
>>>>>>>>>>>> summit file system.
>>>>>>>>>>>> 
>>>>>>>>>>>> Hong
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev 
>>>>>>>>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Linking any other shared library does not slow down the 
>>>>>>>>>>>>> execution. The PETSc shared library is the only one causing 
>>>>>>>>>>>>> trouble.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are the ldd output for two different versions. For the first 
>>>>>>>>>>>>> version, I removed -lpetsc and it ran very fast. The second 
>>>>>>>>>>>>> (slow) version was linked to petsc so. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> bash-4.2$ ldd ex_simple
>>>>>>>>>>>>> linux-vdso64.so.1 =>  (0x0000200000050000)
>>>>>>>>>>>>> liblapack.so.0 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
>>>>>>>>>>>>>  (0x0000200000070000)
>>>>>>>>>>>>> libblas.so.0 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
>>>>>>>>>>>>>  (0x00002000009b0000)
>>>>>>>>>>>>> libhdf5hl_fortran.so.100 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
>>>>>>>>>>>>>  (0x0000200000e80000)
>>>>>>>>>>>>> libhdf5_fortran.so.100 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
>>>>>>>>>>>>>  (0x0000200000ed0000)
>>>>>>>>>>>>> libhdf5_hl.so.100 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
>>>>>>>>>>>>>  (0x0000200000f50000)
>>>>>>>>>>>>> libhdf5.so.103 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
>>>>>>>>>>>>>  (0x0000200000fb0000)
>>>>>>>>>>>>> libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000)
>>>>>>>>>>>>> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
>>>>>>>>>>>>> (0x0000200001770000)
>>>>>>>>>>>>> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
>>>>>>>>>>>>> (0x0000200009b00000)
>>>>>>>>>>>>> libcudart.so.10.1 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
>>>>>>>>>>>>> (0x000020000d950000)
>>>>>>>>>>>>> libcusparse.so.10 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
>>>>>>>>>>>>> (0x000020000d9f0000)
>>>>>>>>>>>>> libcusolver.so.10 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 
>>>>>>>>>>>>> (0x0000200012f50000)
>>>>>>>>>>>>> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000)
>>>>>>>>>>>>> libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000)
>>>>>>>>>>>>> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000)
>>>>>>>>>>>>> libmpiprofilesupport.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
>>>>>>>>>>>>>  (0x000020001de40000)
>>>>>>>>>>>>> libmpi_ibm_usempi.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
>>>>>>>>>>>>>  (0x000020001de70000)
>>>>>>>>>>>>> libmpi_ibm_mpifh.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
>>>>>>>>>>>>>  (0x000020001dea0000)
>>>>>>>>>>>>> libmpi_ibm.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
>>>>>>>>>>>>>  (0x000020001df40000)
>>>>>>>>>>>>> libpgf90rtl.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
>>>>>>>>>>>>>  (0x000020001e0b0000)
>>>>>>>>>>>>> libpgf90.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
>>>>>>>>>>>>>  (0x000020001e0f0000)
>>>>>>>>>>>>> libpgf90_rpm1.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
>>>>>>>>>>>>>  (0x000020001e6a0000)
>>>>>>>>>>>>> libpgf902.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
>>>>>>>>>>>>>  (0x000020001e6d0000)
>>>>>>>>>>>>> libpgftnrtl.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
>>>>>>>>>>>>>  (0x000020001e700000)
>>>>>>>>>>>>> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000)
>>>>>>>>>>>>> libpgkomp.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so
>>>>>>>>>>>>>  (0x000020001e760000)
>>>>>>>>>>>>> libomp.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so
>>>>>>>>>>>>>  (0x000020001e790000)
>>>>>>>>>>>>> libomptarget.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so
>>>>>>>>>>>>>  (0x000020001e880000)
>>>>>>>>>>>>> libpgmath.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so
>>>>>>>>>>>>>  (0x000020001e8b0000)
>>>>>>>>>>>>> libpgc.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so
>>>>>>>>>>>>>  (0x000020001e9d0000)
>>>>>>>>>>>>> librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000)
>>>>>>>>>>>>> libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000)
>>>>>>>>>>>>> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000)
>>>>>>>>>>>>> libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000)
>>>>>>>>>>>>> libz.so.1 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1
>>>>>>>>>>>>>  (0x000020001ee90000)
>>>>>>>>>>>>> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000)
>>>>>>>>>>>>> /lib64/ld64.so.2 (0x0000200000000000)
>>>>>>>>>>>>> libcublasLt.so.10 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 
>>>>>>>>>>>>> (0x000020001ef40000)
>>>>>>>>>>>>> libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000)
>>>>>>>>>>>>> libhwloc_ompi.so.15 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15
>>>>>>>>>>>>>  (0x0000200020e80000)
>>>>>>>>>>>>> libevent-2.1.so.6 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6
>>>>>>>>>>>>>  (0x0000200020ef0000)
>>>>>>>>>>>>> libevent_pthreads-2.1.so.6 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6
>>>>>>>>>>>>>  (0x0000200020f70000)
>>>>>>>>>>>>> libopen-rte.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3
>>>>>>>>>>>>>  (0x0000200020fa0000)
>>>>>>>>>>>>> libopen-pal.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
>>>>>>>>>>>>>  (0x00002000210b0000)
>>>>>>>>>>>>> libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> bash-4.2$ ldd ex_simple_slow
>>>>>>>>>>>>> linux-vdso64.so.1 =>  (0x0000200000050000)
>>>>>>>>>>>>> libpetsc.so.3.012 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012
>>>>>>>>>>>>>  (0x0000200000070000)
>>>>>>>>>>>>> liblapack.so.0 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
>>>>>>>>>>>>>  (0x0000200002be0000)
>>>>>>>>>>>>> libblas.so.0 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
>>>>>>>>>>>>>  (0x0000200003520000)
>>>>>>>>>>>>> libhdf5hl_fortran.so.100 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
>>>>>>>>>>>>>  (0x00002000039f0000)
>>>>>>>>>>>>> libhdf5_fortran.so.100 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
>>>>>>>>>>>>>  (0x0000200003a40000)
>>>>>>>>>>>>> libhdf5_hl.so.100 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
>>>>>>>>>>>>>  (0x0000200003ac0000)
>>>>>>>>>>>>> libhdf5.so.103 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
>>>>>>>>>>>>>  (0x0000200003b20000)
>>>>>>>>>>>>> libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000)
>>>>>>>>>>>>> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
>>>>>>>>>>>>> (0x00002000042e0000)
>>>>>>>>>>>>> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
>>>>>>>>>>>>> (0x000020000c670000)
>>>>>>>>>>>>> libcudart.so.10.1 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
>>>>>>>>>>>>> (0x00002000104c0000)
>>>>>>>>>>>>> libcusparse.so.10 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
>>>>>>>>>>>>> (0x0000200010560000)
>>>>>>>>>>>>> libcusolver.so.10 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 
>>>>>>>>>>>>> (0x0000200015ac0000)
>>>>>>>>>>>>> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000)
>>>>>>>>>>>>> libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000)
>>>>>>>>>>>>> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000)
>>>>>>>>>>>>> libmpiprofilesupport.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
>>>>>>>>>>>>>  (0x00002000209b0000)
>>>>>>>>>>>>> libmpi_ibm_usempi.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
>>>>>>>>>>>>>  (0x00002000209e0000)
>>>>>>>>>>>>> libmpi_ibm_mpifh.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
>>>>>>>>>>>>>  (0x0000200020a10000)
>>>>>>>>>>>>> libmpi_ibm.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
>>>>>>>>>>>>>  (0x0000200020ab0000)
>>>>>>>>>>>>> libpgf90rtl.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
>>>>>>>>>>>>>  (0x0000200020c20000)
>>>>>>>>>>>>> libpgf90.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
>>>>>>>>>>>>>  (0x0000200020c60000)
>>>>>>>>>>>>> libpgf90_rpm1.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
>>>>>>>>>>>>>  (0x0000200021210000)
>>>>>>>>>>>>> libpgf902.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
>>>>>>>>>>>>>  (0x0000200021240000)
>>>>>>>>>>>>> libpgftnrtl.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
>>>>>>>>>>>>>  (0x0000200021270000)
>>>>>>>>>>>>> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000)
>>>>>>>>>>>>> libpgkomp.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so
>>>>>>>>>>>>>  (0x00002000212d0000)
>>>>>>>>>>>>> libomp.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so
>>>>>>>>>>>>>  (0x0000200021300000)
>>>>>>>>>>>>> libomptarget.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so
>>>>>>>>>>>>>  (0x00002000213f0000)
>>>>>>>>>>>>> libpgmath.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so
>>>>>>>>>>>>>  (0x0000200021420000)
>>>>>>>>>>>>> libpgc.so => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so
>>>>>>>>>>>>>  (0x0000200021540000)
>>>>>>>>>>>>> librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000)
>>>>>>>>>>>>> libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000)
>>>>>>>>>>>>> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000)
>>>>>>>>>>>>> libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000)
>>>>>>>>>>>>> libz.so.1 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1
>>>>>>>>>>>>>  (0x0000200021a10000)
>>>>>>>>>>>>> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000)
>>>>>>>>>>>>> /lib64/ld64.so.2 (0x0000200000000000)
>>>>>>>>>>>>> libcublasLt.so.10 => 
>>>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 
>>>>>>>>>>>>> (0x0000200021ab0000)
>>>>>>>>>>>>> libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000)
>>>>>>>>>>>>> libhwloc_ompi.so.15 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15
>>>>>>>>>>>>>  (0x00002000239f0000)
>>>>>>>>>>>>> libevent-2.1.so.6 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6
>>>>>>>>>>>>>  (0x0000200023a60000)
>>>>>>>>>>>>> libevent_pthreads-2.1.so.6 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6
>>>>>>>>>>>>>  (0x0000200023ae0000)
>>>>>>>>>>>>> libopen-rte.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3
>>>>>>>>>>>>>  (0x0000200023b10000)
>>>>>>>>>>>>> libopen-pal.so.3 => 
>>>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
>>>>>>>>>>>>>  (0x0000200023c20000)
>>>>>>>>>>>>> libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <bsm...@mcs.anl.gov> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ldd -o on the executable of both linkings of your code.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My guess is that without PETSc it is linking the static version 
>>>>>>>>>>>>>> of the needed libraries and with PETSc the shared. And, in 
>>>>>>>>>>>>>> typical fashion, the shared libraries are off on some super slow 
>>>>>>>>>>>>>> file system so take a long time to be loaded and linked in on 
>>>>>>>>>>>>>> demand.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Still a performance bug in Summit. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Barry
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev 
>>>>>>>>>>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Previously I have noticed that the first call to a CUDA 
>>>>>>>>>>>>>>> function such as cudaMalloc and cudaFree in PETSc takes a long 
>>>>>>>>>>>>>>> time (7.5 seconds) on summit. Then I prepared a simple example 
>>>>>>>>>>>>>>> as attached to help OCLF reproduce the problem. It turned out 
>>>>>>>>>>>>>>> that the problem was  caused by PETSc. The 7.5-second overhead 
>>>>>>>>>>>>>>> can be observed only when the PETSc lib is linked. If I do not 
>>>>>>>>>>>>>>> link PETSc, it runs normally. Does anyone have any idea why 
>>>>>>>>>>>>>>> this happens and how to fix it?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hong (Mr.)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> bash-4.2$ cat ex_simple.c
>>>>>>>>>>>>>>> #include <time.h>
>>>>>>>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>>>>>>>> #include <stdio.h>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> int main(int argc,char **args)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> clock_t start,s1,s2,s3;
>>>>>>>>>>>>>>> double  cputime;
>>>>>>>>>>>>>>> double   *init,tmp[100] = {0};
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> start = clock();
>>>>>>>>>>>>>>> cudaFree(0);
>>>>>>>>>>>>>>> s1 = clock();
>>>>>>>>>>>>>>> cudaMalloc((void **)&init,100*sizeof(double));
>>>>>>>>>>>>>>> s2 = clock();
>>>>>>>>>>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>>>>>>>>>>>>>> s3 = clock();
>>>>>>>>>>>>>>> printf("free time =%lf malloc time =%lf copy time 
>>>>>>>>>>>>>>> =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 
>>>>>>>>>>>>>>> - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>>>>>> experiments is infinitely more interesting than any results to 
>>>>>>>>>>> which their experiments lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>> 
>>>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>> experiments is infinitely more interesting than any results to which 
>>>>>>> their experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>> 
>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to