Hi Andreas,

thanks for the quick answer! It seems that my question is somehow related to 
scikits.cuda. Executing the code attached to this email sketches the issue: By 
uncommenting the line "#model.fit(a_gpu, b_gpu)", the profiling output 
"compiler.py:185(compile)" vanishes. The manual PyCUDA kernel seems be compiled 
beforehand (during the initializing the model). The call "culinalg.dot", 
however, seems to cause the compiler output after having initialized the model 
...


For this toy example, there seems to be no big time difference for the second 
call of "fit" (i.e., whether the first "fit" line is uncommented or not; on my 
machine, it takes about 0.41 seconds in both cases). However, for the project I 
am working on (which is too big to share), it makes a difference (1.35 seconds 
instead of 1.662 seconds, as mentioned in my previous email). 

Cheers
Fabian







Andreas Kloeckner <li...@informa.tiker.net> schrieb am 17:08 Donnerstag, 
19.Februar 2015:
<fas...@yahoo.de> writes:


> Hi,
>
> is it possible to "precompile" the invoked kernels beforehand? My code makes 
> use of several CUDA kernels, which are basically called within a "fit" 
> function. Profiling the code with cProfile yields:
>
> 42272 function calls (42228 primitive calls) in 1.662 seconds
> ...
>
> 11    0.000    0.000    0.344    0.031 compiler.py:185(compile)
> 11    0.002    0.000    0.346    0.031 compiler.py:245(__init__)
> 4    0.000    0.000    0.317    0.079 compiler.py:33(preprocess_source)
> 11    0.000    0.000    0.342    0.031 compiler.py:66(compile_plain)
> ...
>
> Thus, about 0.344 of the 1.662 seconds are spent on compiling the
> code. When executing the function "fit" twice, the code is not
> compiled again (hence, saving these 0.344 seconds for the second call
> of "fit"). I would like to somehow precompile all involved kernels as
> soon as the object the "fit" function belongs to is initialized...
>
>
> Can one invoke the overall compilation process beforehand?

Sure! That's what the SourceModule constructor does. Just keep the
instance around.

Andreas
import sys
import math
import copy
import time
import numpy
import logging
import pycuda
import pycuda.gpuarray as gpuarray
import scikits.cuda.linalg as culinalg
from pycuda.compiler import SourceModule
import pycuda.cumath as cumath
import cProfile

class MyModel(object):

    def __init__(self, device_id=0):

        self.device_id = device_id
        self._init_device()

    def __del__(self):

        self.ctx.pop()

    def _init_device(self):
        """ Initializes the device.
        """
        
        # sanity check for device id
        if (self.device_id < 0) or (self.device_id > pycuda.driver.Device.count() - 1):
            raise Exception("Invalid device id. Maximum device id is " + str(pycuda.driver.Device.count() - 1))
        
        # init driver and context
        pycuda.driver.init()
        self.device = pycuda.driver.Device(self.device_id)
        self.ctx = self.device.make_context()
        
        # init linalg for cuda
        culinalg.init()
        
        # init kernels
        self._init_kernels()

    def _init_kernels(self):

        kernel_elementwise_multiply_mod = SourceModule("""
        __global__ void elementwise_multiply(float *dest, float *a, int n, int m)
        {
        
          
          int tidx = threadIdx.x + blockDim.x * blockIdx.x;
          int tidy = threadIdx.y + blockDim.y * blockIdx.y;
          
          if (tidx >= n || tidy >= m){
              return;
          }
          
          dest[tidx*m + tidy] = a[tidx*m + tidy]*a[tidx*m + tidy];
          
        }        
        """)
        self.kernel_elementwise_multiply = kernel_elementwise_multiply_mod.get_function("elementwise_multiply")  

    def _elementwise_mult(self, a):

        n = a.shape[0]
        m = a.shape[1]
        dest = gpuarray.empty((n, m), numpy.float32)
         
        blocksize = 16
        block = (blocksize, blocksize, 1)
        grid = (int(math.ceil(float(n) / blocksize)), int(math.ceil(float(m) / blocksize)))

        self.kernel_elementwise_multiply(dest, a, numpy.int32(n), numpy.int32(m), block=block, grid=grid)        

        return dest

    def _mult(self, a, b, transa=False, transb=False):
        if transa:
            transa = "T"
        else:
            transa = "N"
            
        if transb:
            transb = "T"
        else:
            transb = "N"
            
        return culinalg.dot(a, b, transa=transa, transb=transb)

    def fit(self, a_gpu, b_gpu):

        test = self._mult(a_gpu, b_gpu)
        test2 = self._elementwise_mult(a_gpu)
        
        return test2 - test

# initialize model
model = MyModel(device_id=0)

# test arrays
a = numpy.random.randn(8000*8000).astype(numpy.float32).reshape((8000,8000))
b = numpy.random.randn(8000*8000).astype(numpy.float32).reshape((8000,8000))
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)

# NOTE: By uncommenting this line, the profiling output "compiler.py:185(compile)" vanishes 
#model.fit(a_gpu, b_gpu)

# this call is profiled
def run():
    res = model.fit(a_gpu, b_gpu)
    print res.get()

start = time.time()
cProfile.run("run()")
end = time.time()
print("Elapsed time: " + str(end-start))


_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to