Ah, that works, I was thinking you were going to use some form of auto-tuned CUDA code as you do with the OpenCL. Calling the cuBLAS routines is just fine. Having them 'behind the scenes' sounds good to me :)
Cheers, Charles On Mon, Aug 3, 2015 at 8:06 AM, Karl Rupp <r...@iue.tuwien.ac.at> wrote: > > I am glad that I can at least understand why I am seeing this >> difference. I absolutely think the CUDA 'port' should be added to >> ViennaCL. It certainly may be preferable to some to call the direct >> cuBLAS routines but I am in favor of trying to find a balance between >> speed and 'ease-of-use'. From my point of view, having both optimized >> OpenCL and CUDA kernels would be a great selling point for ViennaCL. >> > > well, we would actually call the cuBLAS routines internally, so a user > would not get in touch with it at all. Performance *and* ease-of-use so to > say ;-) > > Best regards, > Karli > > > > >> On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp <r...@iue.tuwien.ac.at >> <mailto:r...@iue.tuwien.ac.at>> wrote: >> >> Hi Charles, >> >> > I was benchmarking 4096x4096 matrices (again, with my R bindings). >> By >> >> 'slower' I mean that I am observing OpenCL at this size beating >> the >> OpenBLAS CPU implementation by over 2X but the CUDA >> implementation is >> nearly 5X slower than the CPU. This seemed odd to me that the >> CUDA >> would be so much slower than the OpenCL, hence my initial thought >> to >> invite others to review my code if I am making some sort of silly >> mistake. Otherwise I was intending to begin trying to pursue >> direct >> cublas methods but I would very much prefer to use ViennaCL. >> >> >> okay, in this case what Philippe was just the full answer. Our >> OpenCL kernels are highly GPU-specific and generate a 'good' kernel >> at runtime. We haven't 'ported' (i.e. a one-to-one translation from >> OpenCL to CUDA) these kernels to the CUDA backend yet, so only a >> fallback kernel is used for the CUDA backend. It should be possible >> to carry these over with not too much effort, but in such case it >> makes more sense to just call the cuBLAS routines instead. Adding >> this for ViennaCL 1.7.1 is certainly possible if that is what you >> would be happy with. >> >> Best regards, >> Karli >> >> >> >> On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp <r...@iue.tuwien.ac.at >> <mailto:r...@iue.tuwien.ac.at> >> <mailto:r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>>> >> >> wrote: >> >> Hi Charles, >> >> can you please quantify what you mean by 'slower'? How does >> 'slower' >> change as you increase the problem size? I would not be >> surprised if >> you see no performance gains below matrices of size >> 500-by-500. With >> the extra back-and-forth through PCI-Express you may even >> need >> matrices of at least 1000-by-1000. >> >> Best regards, >> Karli >> >> >> >> On 07/31/2015 09:04 PM, Charles Determan wrote: >> >> Greetings, >> >> Brief background, I am developing a series of R >> packages to bring >> ViennaCL to the R community. I have had success with the >> development of >> my gpuR package (https://github.com/cdeterman/gpuR) >> which relies >> on the >> OpenCL backend of ViennaCL (which is housed in the >> package >> RViennaCL). >> I am hoping to submit to CRAN in the coming weeks now >> that the >> latest >> stable ViennaCL version has just been released. >> >> Naturally, I wanted a companion package for a CUDA >> backend. >> This is now >> the gpuRcuda package >> (https://github.com/cdeterman/gpuRcuda). >> This has >> appeared to work successfully as most of the code is >> the same. >> However, >> my initial benchmarks are showing very dismal >> performance with >> the CUDA >> backend. >> >> I was wondering if someone from this list would be >> willing to have a >> look at my code to see why the CUDA code would be so much >> worse. I had >> thought, given working a NVIDIA card (GeForce GTX 970), >> CUDA would >> provide improved speed but the benchmarks are showing >> performance at >> least 5-fold slower than the CPU based R >> multiplication. Even the >> 'float' type matrix multiplication is slower than R >> (which only has >> double type support!). >> >> The sgemm CUDA file is >> >> ( >> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) >> and >> the associated C++ file is >> >> ( >> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp >> ). >> >> Other note, I have tried making the two packages >> completely >> independent >> and the performance is still very poor with CUDA. >> >> I really appreciate any help others could provide >> troubleshooting this. >> I have truly run out of ideas as to why the code has >> such poor >> performance. >> >> Regards, >> Charles >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> ViennaCL-devel mailing list >> ViennaCL-devel@lists.sourceforge.net >> <mailto:ViennaCL-devel@lists.sourceforge.net> >> <mailto:ViennaCL-devel@lists.sourceforge.net >> <mailto:ViennaCL-devel@lists.sourceforge.net>> >> https://lists.sourceforge.net/lists/listinfo/viennacl-devel >> >> >> >> >> >> >
------------------------------------------------------------------------------
_______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel