> I am glad that I can at least understand why I am seeing this > difference. I absolutely think the CUDA 'port' should be added to > ViennaCL. It certainly may be preferable to some to call the direct > cuBLAS routines but I am in favor of trying to find a balance between > speed and 'ease-of-use'. From my point of view, having both optimized > OpenCL and CUDA kernels would be a great selling point for ViennaCL.
well, we would actually call the cuBLAS routines internally, so a user would not get in touch with it at all. Performance *and* ease-of-use so to say ;-) Best regards, Karli > > On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp <[email protected] > <mailto:[email protected]>> wrote: > > Hi Charles, > > > I was benchmarking 4096x4096 matrices (again, with my R bindings). By > > 'slower' I mean that I am observing OpenCL at this size beating the > OpenBLAS CPU implementation by over 2X but the CUDA > implementation is > nearly 5X slower than the CPU. This seemed odd to me that the CUDA > would be so much slower than the OpenCL, hence my initial thought to > invite others to review my code if I am making some sort of silly > mistake. Otherwise I was intending to begin trying to pursue direct > cublas methods but I would very much prefer to use ViennaCL. > > > okay, in this case what Philippe was just the full answer. Our > OpenCL kernels are highly GPU-specific and generate a 'good' kernel > at runtime. We haven't 'ported' (i.e. a one-to-one translation from > OpenCL to CUDA) these kernels to the CUDA backend yet, so only a > fallback kernel is used for the CUDA backend. It should be possible > to carry these over with not too much effort, but in such case it > makes more sense to just call the cuBLAS routines instead. Adding > this for ViennaCL 1.7.1 is certainly possible if that is what you > would be happy with. > > Best regards, > Karli > > > > On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp <[email protected] > <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > wrote: > > Hi Charles, > > can you please quantify what you mean by 'slower'? How does > 'slower' > change as you increase the problem size? I would not be > surprised if > you see no performance gains below matrices of size > 500-by-500. With > the extra back-and-forth through PCI-Express you may even need > matrices of at least 1000-by-1000. > > Best regards, > Karli > > > > On 07/31/2015 09:04 PM, Charles Determan wrote: > > Greetings, > > Brief background, I am developing a series of R > packages to bring > ViennaCL to the R community. I have had success with the > development of > my gpuR package (https://github.com/cdeterman/gpuR) > which relies > on the > OpenCL backend of ViennaCL (which is housed in the package > RViennaCL). > I am hoping to submit to CRAN in the coming weeks now > that the > latest > stable ViennaCL version has just been released. > > Naturally, I wanted a companion package for a CUDA backend. > This is now > the gpuRcuda package > (https://github.com/cdeterman/gpuRcuda). > This has > appeared to work successfully as most of the code is > the same. > However, > my initial benchmarks are showing very dismal > performance with > the CUDA > backend. > > I was wondering if someone from this list would be > willing to have a > look at my code to see why the CUDA code would be so much > worse. I had > thought, given working a NVIDIA card (GeForce GTX 970), > CUDA would > provide improved speed but the benchmarks are showing > performance at > least 5-fold slower than the CPU based R > multiplication. Even the > 'float' type matrix multiplication is slower than R > (which only has > double type support!). > > The sgemm CUDA file is > > (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) > and > the associated C++ file is > > > (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp). > > Other note, I have tried making the two packages completely > independent > and the performance is still very poor with CUDA. > > I really appreciate any help others could provide > troubleshooting this. > I have truly run out of ideas as to why the code has > such poor > performance. > > Regards, > Charles > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > ViennaCL-devel mailing list > [email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > > > > > ------------------------------------------------------------------------------ _______________________________________________ ViennaCL-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/viennacl-devel
