rich7420 opened a new issue, #699:
URL: https://github.com/apache/mahout/issues/699

   ### Summary
   
   we have a lot of costs from `cudaMalloc` and  `cudaFree`. I think we need to 
change recent method to Staging Buffer Pool way.
   
   
   ```
   Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max 
(ns)  StdDev (ns)              Name            
    --------  ---------------  ---------  ---------  ---------  --------  
--------  -----------  ----------------------------
        54.5       2963528632       8000   370441.1   378702.0      4008   
3003620     311060.3  cuStreamSynchronize         
        17.6        955866327       1973   484473.6   123488.0     93255  
16764314    2141400.0  cudaFree                    
        13.1        715242357       2006   356551.5   337836.5      3350   
2746979     117384.7  cudaMalloc                  
         5.5        298952961      12186    24532.5    16054.5      5604  
17572514     181458.8  cudaLaunchKernel            
         5.2        280867513       4000    70216.9    61677.5     38660    
662149      31375.0  cuMemcpyHtoDAsync_v2        
         1.9        104137233       4000    26034.3    19925.0     13352    
163122      13620.3  cuLaunchKernel              
         0.8         46113060      28002     1646.8      606.0       165    
240862       3654.9  cuCtxSetCurrent             
         0.8         41350287       8000     5168.8     3516.0       423   
6532567      73131.5  cuMemAllocAsync             
         0.5         25542603       8000     3192.8     2747.0      1383     
84901       2467.2  cuMemFreeAsync              
         0.1          5639877       2006     2811.5     1113.0       255   
3092087      69029.1  cudaStreamIsCapturing_v10000
         0.0          2564314          2  1282157.0  1282157.0     21331   
2542983    1783077.2  cudaMemcpyAsync             
         0.0           736822          1   736822.0   736822.0    736822    
736822          0.0  cuModuleLoadData            
         0.0           427895          2   213947.5   213947.5      3496    
424399     297623.4  cudaDeviceSynchronize       
         0.0           152769       1149      133.0      100.0        58     
11135        338.9  cuGetProcAddress_v2         
         0.0            77894          2    38947.0    38947.0     20841     
57053      25605.8  cudaStreamSynchronize       
         0.0            15673         18      870.7      220.5       208      
8021       1861.3  cudaEventCreateWithFlags    
         0.0             4043          3     1347.7     1287.0       989      
1767        392.5  cuInit                      
         0.0             1539          1     1539.0     1539.0      1539      
1539          0.0  cuEventCreate               
         0.0             1416          1     1416.0     1416.0      1416      
1416          0.0  cuEventDestroy_v2           
         0.0              390          3      130.0      130.0       128       
132          2.0  cuModuleGetLoadingMode  
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to