Karl, I was benchmarking 4096x4096 matrices (again, with my R bindings). By 'slower' I mean that I am observing OpenCL at this size beating the OpenBLAS CPU implementation by over 2X but the CUDA implementation is nearly 5X slower than the CPU. This seemed odd to me that the CUDA would be so much slower than the OpenCL, hence my initial thought to invite others to review my code if I am making some sort of silly mistake. Otherwise I was intending to begin trying to pursue direct cublas methods but I would very much prefer to use ViennaCL.
Regards, Charles On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp <r...@iue.tuwien.ac.at> wrote: > Hi Charles, > > can you please quantify what you mean by 'slower'? How does 'slower' > change as you increase the problem size? I would not be surprised if you > see no performance gains below matrices of size 500-by-500. With the extra > back-and-forth through PCI-Express you may even need matrices of at least > 1000-by-1000. > > Best regards, > Karli > > > > On 07/31/2015 09:04 PM, Charles Determan wrote: > >> Greetings, >> >> Brief background, I am developing a series of R packages to bring >> ViennaCL to the R community. I have had success with the development of >> my gpuR package (https://github.com/cdeterman/gpuR) which relies on the >> OpenCL backend of ViennaCL (which is housed in the package RViennaCL). >> I am hoping to submit to CRAN in the coming weeks now that the latest >> stable ViennaCL version has just been released. >> >> Naturally, I wanted a companion package for a CUDA backend. This is now >> the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has >> appeared to work successfully as most of the code is the same. However, >> my initial benchmarks are showing very dismal performance with the CUDA >> backend. >> >> I was wondering if someone from this list would be willing to have a >> look at my code to see why the CUDA code would be so much worse. I had >> thought, given working a NVIDIA card (GeForce GTX 970), CUDA would >> provide improved speed but the benchmarks are showing performance at >> least 5-fold slower than the CPU based R multiplication. Even the >> 'float' type matrix multiplication is slower than R (which only has >> double type support!). >> >> The sgemm CUDA file is >> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and >> the associated C++ file is >> ( >> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp >> ). >> >> Other note, I have tried making the two packages completely independent >> and the performance is still very poor with CUDA. >> >> I really appreciate any help others could provide troubleshooting this. >> I have truly run out of ideas as to why the code has such poor >> performance. >> >> Regards, >> Charles >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> ViennaCL-devel mailing list >> ViennaCL-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/viennacl-devel >> >> >
------------------------------------------------------------------------------
_______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel