Received from Keith Brown on Sun, Nov 08, 2015 at 10:23:18PM EST:
> I have several thousand matrices where I need to calculate their dot
> product. So, it seems pyCuda should do the trick (i hope). I am
> running into an issue with block sizes.
> 
> Here is my code
> 
> #!/usr/bin/env python
> import sys,time
> from string import Template
> import numpy as np
> from pycuda import driver, compiler, gpuarray, tools
> from pycuda.compiler import SourceModule
> import pycuda.autoinit
> 
> 
> def main():
>     d={}
>     size=4
>     d['size']=size
> 
>     src=Template("""
> __global__ void MatrixMulKernel(float *a, float *b, float *c)
> {
>     int tx = threadIdx.x;
>     int ty = threadIdx.y;
>     float Pvalue = 0;
> 
>     for (int k = 0; k < $size; ++k) {
>         float Aelement = a[ty * $size + k];
>         float Belement = b[k * $size + tx];
>         Pvalue += Aelement * Belement;
>     }
>    c[ty * $size + tx] = Pvalue;
> 
>     }
>     """)
> 
>     #src.substitute(d)
> 
>     a_cpu = np.random.randn(size,size).astype(np.float32)
>     b_cpu = np.random.randn(size,size).astype(np.float32)
> 
>     a_gpu=gpuarray.to_gpu(a_cpu)
>     b_gpu=gpuarray.to_gpu(b_cpu)
>     c_gpu = gpuarray.empty((size,size), np.float32)
> 
>     src.substitute(d)
>     mod = compiler.SourceModule(src.substitute(d))
>     mm=mod.get_function("MatrixMulKernel")
>     v=mm(a_gpu,b_gpu,c_gpu,
>             block=(16,16,1),
>             )
>     start=time.time()
> 
>     gpu_ans=c_gpu.get()
>     stop=time.time()
>     print "Gpu",stop-start
> 
>     start=time.time()
>     cpu_ans=np.dot(a_cpu,b_cpu)
>     stop=time.time()
>     print "Cpu",stop-start
> 
> 
>     #print gpu_ans
>     #print cpu_ans
>     print np.allclose(gpu_ans,cpu_ans,atol=1e-03)
> 
> 
> Couple of issues:
> When I increase size of matrix it seems it gets less accurate than CPU
> dot product therefore I have to use np.allclose to compare.

It isn't necessary clear that the CPU answer is "more accurate"; since the
summations performed on the GPU may occur in a different order than those on the
CPU and since floating point addition is not associative, the difference between
the GPU and CPU results may become more pronounced for the larger summations
required when computing the dot product of large matrices.

> Also, what is the optimal block size I should be using?

It depends on your matrix size; you generally want to set the block (and grid)
size to maximize the number of threads active at a specific time.

If your matrices are very small (4 x 4), it isn't clear that using the GPU will
save you much time compared to using numpy because of the cost of copying the
matrices to and from GPU memory.

Note that if you are dealing with large matrices, you may wish to check out the
CUBLAS functions for matrix multiplication; a dot() function that uses those
functions is available in scikit-cuda [1], although the Python code that makes
the function easy to use may impose some noticeable overhead if you plan to
invoke it several thousand times.

[1] http://scikit-cuda.rtfd.org
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/


_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to