
I am a newbie with regards to GPU computations and before embarking on trying to put a calculation onto the GPU, I wanted to ask if there is a significant uplift in execution speed likely for this scenario and how to best go about this in pyCuda.

I have a problem where I have to calculate the probability density of a few multidimensional Gaussians (order of 10) for many vectors (hundreds of thousands to millions). The length of the vectors is usually in the order of 100-600.

Currently I am doing this in Python with numpy (backed by MKL). I pre-compute the covariance matrices and their determinants and then I calculate

    def calc_observation_likelihoods(self, x, y, t):
data = self.observations.retrieve_traces(x,y) #Here I get the data self.calc_inverses(t) # Here I retrieve the cached covariance matrices log_likelihoods = self.log_likelihoods # Just a previously defined array

#Here comes the part I woud like to speed up

        for f_ind in range(Mk):
log_det, U = self.inverse_structure[f_ind] #This is currently a dictionary holding the required information
            mu = self.likelihood_means[:, f_ind] # Also cached
dev = (data - mu)*self.scaling # Ths scaling is required to prevent over/underflow
            rank = dev.shape[0]
maha = np.sum(np.square(np.dot(dev, U)), axis=-1) ###This is the line I spend most time on.
            res = -0.5 * (rank * _LOG_2PI + log_det + maha)
            log_likelihoods[f_ind] = res
            #print dev.max(), dev.min(), res
        log_likelihood_maxs = log_likelihoods.max()
        log_likelihoods -= log_likelihood_maxs
        likelihoods = np.exp(log_likelihoods)
        return likelihoods

The problem looks to me like something where it should be possible to get a big speedup with the GPU, however, I have to shuffle a lot of data around, I suppose. Therefore I would be grateful to get some pointers if this looks like a problem where it is promising to try to use the GPU.

running deviceQuery from the cuda samples on my GPU gives:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro 6000"
  CUDA Driver Version / Runtime Version          7.0 / 7.0
  CUDA Capability Major/Minor version number:    2.0
Total amount of global memory: 6143 MBytes (6441598976 bytes)
  (14) Multiprocessors, ( 32) CUDA Cores/MP:     448 CUDA Cores
  GPU Max Clock rate:                            1147 MHz (1.15 GHz)
  Memory Clock rate:                             1494 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Quadro 6000
Result = PASS

