Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released.
Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp ). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide troubleshooting this. I have truly run out of ideas as to why the code has such poor performance. Regards, Charles
------------------------------------------------------------------------------
_______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel