Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

Charles Determan Mon, 03 Aug 2015 06:20:54 -0700

Ah, that works, I was thinking you were going to use some form of
auto-tuned CUDA code as you do with the OpenCL.  Calling the cuBLAS
routines is just fine.  Having them 'behind the scenes' sounds good to me :)


Cheers,
Charles

On Mon, Aug 3, 2015 at 8:06 AM, Karl Rupp <r...@iue.tuwien.ac.at> wrote:

>
> I am glad that I can at least understand why I am seeing this
>> difference.  I absolutely think the CUDA 'port' should be added to
>> ViennaCL.  It certainly may be preferable to some to call the direct
>> cuBLAS routines but I am in favor of trying to find a balance between
>> speed and 'ease-of-use'.  From my point of view, having both optimized
>> OpenCL and CUDA kernels would be a great selling point for ViennaCL.
>>
>
> well, we would actually call the cuBLAS routines internally, so a user
> would not get in touch with it at all. Performance *and* ease-of-use so to
> say ;-)
>
> Best regards,
> Karli
>
>
>
>
>> On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp <r...@iue.tuwien.ac.at
>> <mailto:r...@iue.tuwien.ac.at>> wrote:
>>
>>     Hi Charles,
>>
>>     > I was benchmarking 4096x4096 matrices (again, with my R bindings).
>> By
>>
>>         'slower' I mean that I am observing OpenCL at this size beating
>> the
>>         OpenBLAS CPU implementation by over 2X but the CUDA
>>         implementation is
>>         nearly 5X slower than the CPU.  This seemed odd to me that the
>> CUDA
>>         would be so much slower than the OpenCL, hence my initial thought
>> to
>>         invite others to review my code if I am making some sort of silly
>>         mistake.  Otherwise I was intending to begin trying to pursue
>> direct
>>         cublas methods but I would very much prefer to use ViennaCL.
>>
>>
>>     okay, in this case what Philippe was just the full answer. Our
>>     OpenCL kernels are highly GPU-specific and generate a 'good' kernel
>>     at runtime. We haven't 'ported' (i.e. a one-to-one translation from
>>     OpenCL to CUDA) these kernels to the CUDA backend yet, so only a
>>     fallback kernel is used for the CUDA backend. It should be possible
>>     to carry these over with not too much effort, but in such case it
>>     makes more sense to just call the cuBLAS routines instead. Adding
>>     this for ViennaCL 1.7.1 is certainly possible if that is what you
>>     would be happy with.
>>
>>     Best regards,
>>     Karli
>>
>>
>>
>>         On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp <r...@iue.tuwien.ac.at
>>         <mailto:r...@iue.tuwien.ac.at>
>>         <mailto:r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>>>
>>
>>         wrote:
>>
>>              Hi Charles,
>>
>>              can you please quantify what you mean by 'slower'? How does
>>         'slower'
>>              change as you increase the problem size? I would not be
>>         surprised if
>>              you see no performance gains below matrices of size
>>         500-by-500. With
>>              the extra back-and-forth through PCI-Express you may even
>> need
>>              matrices of at least 1000-by-1000.
>>
>>              Best regards,
>>              Karli
>>
>>
>>
>>              On 07/31/2015 09:04 PM, Charles Determan wrote:
>>
>>                  Greetings,
>>
>>                  Brief background, I am developing a series of R
>>         packages to bring
>>                  ViennaCL to the R community.  I have had success with the
>>                  development of
>>                  my gpuR package (https://github.com/cdeterman/gpuR)
>>         which relies
>>                  on the
>>                  OpenCL backend of ViennaCL (which is housed in the
>> package
>>                  RViennaCL).
>>                  I am hoping to submit to CRAN in the coming weeks now
>>         that the
>>                  latest
>>                  stable ViennaCL version has just been released.
>>
>>                  Naturally, I wanted a companion package for a CUDA
>> backend.
>>                  This is now
>>                  the gpuRcuda package
>>         (https://github.com/cdeterman/gpuRcuda).
>>                  This has
>>                  appeared to work successfully as most of the code is
>>         the same.
>>                  However,
>>                  my initial benchmarks are showing very dismal
>>         performance with
>>                  the CUDA
>>                  backend.
>>
>>                  I was wondering if someone from this list would be
>>         willing to have a
>>                  look at my code to see why the CUDA code would be so much
>>                  worse.  I had
>>                  thought, given working a NVIDIA card (GeForce GTX 970),
>>         CUDA would
>>                  provide improved speed but the benchmarks are showing
>>         performance at
>>                  least 5-fold slower than the CPU based R
>>         multiplication.  Even the
>>                  'float' type matrix multiplication is slower than R
>>         (which only has
>>                  double type support!).
>>
>>                  The sgemm CUDA file is
>>
>>         (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
>>                  and
>>                  the associated C++ file is
>>
>>         (
>> https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
>> ).
>>
>>                  Other note, I have tried making the two packages
>> completely
>>                  independent
>>                  and the performance is still very poor with CUDA.
>>
>>                  I really appreciate any help others could provide
>>                  troubleshooting this.
>>                  I have truly run out of ideas as to why the code has
>>         such poor
>>                  performance.
>>
>>                  Regards,
>>                  Charles
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>>                  _______________________________________________
>>                  ViennaCL-devel mailing list
>>         ViennaCL-devel@lists.sourceforge.net
>>         <mailto:ViennaCL-devel@lists.sourceforge.net>
>>                  <mailto:ViennaCL-devel@lists.sourceforge.net
>>         <mailto:ViennaCL-devel@lists.sourceforge.net>>
>>         https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>>
>>
>>
>>
>>
>>
>

------------------------------------------------------------------------------

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

Reply via email to