Hi Lev,
I started to use NVBLAS which is an implementation of BLAS for GPUs.
So far its OK (i think). I don't see all of my GPUs being utilized
when I do a numpy.dot(A,B). I will play more around with it to get a
better ideas. I want to avoid writing my own matrix multiplication
method.



On Mon, Nov 9, 2015 at 12:26 AM, Lev Givon <l...@columbia.edu> wrote:
> Received from Keith Brown on Sun, Nov 08, 2015 at 11:46:47PM EST:
>> Thanks Lev.
>> My matrix size is going to be large, somewhere near n=100000.
>
> (I assume n = total number of elements in the matrix; a matrix of size 10**5 x
> 10**5 32-bit floating point values would require more memory than currently
> available GPUs can provide.)
>
>> So, how can I test between CPU and GPU matrix math? I though my
>> technique was good enough but apparently not.
>
> If you are trying to ensure that the CPU and GPU are doing as similar floating
> point computations as possible, you may want to look into whether the 
> intrinsic
> single precision functions that CUDA provides to enable control of rounding
> during addition and multiplication (e.g., __fadd_rd, __fad_rn, etc.) may be
> useful, as well as compiler options that affect processing of denormals (e.g.,
> --ftz). For the purposes of checking algorithmic correctness against an 
> existing
> (CPU-based) implementation, you may want to use double precision (even if you
> plan to use single precision for your actual computations). In general, 
> though,
> it is prudent to test results (via allclose()) with some defined tolerance in
> light of the effects of floating point operations.
> --
> Lev Givon
> Bionet Group | Neurokernel Project
> http://lebedov.github.io/
> http://neurokernel.github.io/
>

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to