> I am glad that I can at least understand why I am seeing this
> difference.  I absolutely think the CUDA 'port' should be added to
> ViennaCL.  It certainly may be preferable to some to call the direct
> cuBLAS routines but I am in favor of trying to find a balance between
> speed and 'ease-of-use'.  From my point of view, having both optimized
> OpenCL and CUDA kernels would be a great selling point for ViennaCL.

well, we would actually call the cuBLAS routines internally, so a user 
would not get in touch with it at all. Performance *and* ease-of-use so 
to say ;-)

Best regards,
Karli



>
> On Mon, Aug 3, 2015 at 7:37 AM, Karl Rupp <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi Charles,
>
>     > I was benchmarking 4096x4096 matrices (again, with my R bindings).  By
>
>         'slower' I mean that I am observing OpenCL at this size beating the
>         OpenBLAS CPU implementation by over 2X but the CUDA
>         implementation is
>         nearly 5X slower than the CPU.  This seemed odd to me that the CUDA
>         would be so much slower than the OpenCL, hence my initial thought to
>         invite others to review my code if I am making some sort of silly
>         mistake.  Otherwise I was intending to begin trying to pursue direct
>         cublas methods but I would very much prefer to use ViennaCL.
>
>
>     okay, in this case what Philippe was just the full answer. Our
>     OpenCL kernels are highly GPU-specific and generate a 'good' kernel
>     at runtime. We haven't 'ported' (i.e. a one-to-one translation from
>     OpenCL to CUDA) these kernels to the CUDA backend yet, so only a
>     fallback kernel is used for the CUDA backend. It should be possible
>     to carry these over with not too much effort, but in such case it
>     makes more sense to just call the cuBLAS routines instead. Adding
>     this for ViennaCL 1.7.1 is certainly possible if that is what you
>     would be happy with.
>
>     Best regards,
>     Karli
>
>
>
>         On Sat, Aug 1, 2015 at 3:56 AM, Karl Rupp <[email protected]
>         <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>>
>         wrote:
>
>              Hi Charles,
>
>              can you please quantify what you mean by 'slower'? How does
>         'slower'
>              change as you increase the problem size? I would not be
>         surprised if
>              you see no performance gains below matrices of size
>         500-by-500. With
>              the extra back-and-forth through PCI-Express you may even need
>              matrices of at least 1000-by-1000.
>
>              Best regards,
>              Karli
>
>
>
>              On 07/31/2015 09:04 PM, Charles Determan wrote:
>
>                  Greetings,
>
>                  Brief background, I am developing a series of R
>         packages to bring
>                  ViennaCL to the R community.  I have had success with the
>                  development of
>                  my gpuR package (https://github.com/cdeterman/gpuR)
>         which relies
>                  on the
>                  OpenCL backend of ViennaCL (which is housed in the package
>                  RViennaCL).
>                  I am hoping to submit to CRAN in the coming weeks now
>         that the
>                  latest
>                  stable ViennaCL version has just been released.
>
>                  Naturally, I wanted a companion package for a CUDA backend.
>                  This is now
>                  the gpuRcuda package
>         (https://github.com/cdeterman/gpuRcuda).
>                  This has
>                  appeared to work successfully as most of the code is
>         the same.
>                  However,
>                  my initial benchmarks are showing very dismal
>         performance with
>                  the CUDA
>                  backend.
>
>                  I was wondering if someone from this list would be
>         willing to have a
>                  look at my code to see why the CUDA code would be so much
>                  worse.  I had
>                  thought, given working a NVIDIA card (GeForce GTX 970),
>         CUDA would
>                  provide improved speed but the benchmarks are showing
>         performance at
>                  least 5-fold slower than the CPU based R
>         multiplication.  Even the
>                  'float' type matrix multiplication is slower than R
>         (which only has
>                  double type support!).
>
>                  The sgemm CUDA file is
>
>         (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu)
>                  and
>                  the associated C++ file is
>
>         
> (https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp).
>
>                  Other note, I have tried making the two packages completely
>                  independent
>                  and the performance is still very poor with CUDA.
>
>                  I really appreciate any help others could provide
>                  troubleshooting this.
>                  I have truly run out of ideas as to why the code has
>         such poor
>                  performance.
>
>                  Regards,
>                  Charles
>
>
>
>         
> ------------------------------------------------------------------------------
>
>
>
>                  _______________________________________________
>                  ViennaCL-devel mailing list
>         [email protected]
>         <mailto:[email protected]>
>                  <mailto:[email protected]
>         <mailto:[email protected]>>
>         https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
>
>
>
>


------------------------------------------------------------------------------
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to