Hello everyone,

We're actively working on adding native CUDA support to Apache Mahout. 
Currently, GPU acceleration is enabled through ViennaCL 
(http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework that 
provides multiple backends including OpenMP, OpenCL and CUDA. However, as we 
recently discovered the CUDA backend in ViennaCL is composed of manually 
written CUDA kernels that are not well tuned for the latest GPU architectures. 
Instead, we decided to explore a way to leverage CUDA libraries for linear 
algebra: cuBLAS (dense matrices), cuSPARSE (sparse matrices) and cuSOLVER 
(dense factorizations and sparse solvers). These libraries are highly tuned by 
NVIDIA and provide the best performance for many linear algebra primitives on 
the NVIDIA GPU architecture. Moreover, the libraries are receiving frequent 
updates with new CUDA toolkit releases: bug fixes, new functionality and 
optimizations.

We considered two approaches:

  1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
  2.  Use JCuda package (http://www.jcuda.org/)

JCuda is a thin Java layer on top of the CUDA runtime and already provides Java 
wrappers for all available CUDA libraries so it makes sense to choose this 
path. JCuda also provides a mechanism to call custom CUDA kernels by compiling 
them into PTX with NVIDIA NVCC compiler and then loading through CUDA driver 
API calls in Java code. Here is an example code that allocates a pointer 
(cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:

// Allocate memory on the device
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, memorySize);

// Copy the host data to the device
cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
          cudaMemcpyKind.cudaMemcpyHostToDevice);

Alternatively, a pointer can be allocated using cudaMallocManaged and then it 
can be accessed on the CPU or on the GPU without explicit copies by leveraging 
Unified Memory. This enables simpler data management model and on the newer 
architectures enables features like on-demand paging and transparent GPU memory 
oversubscription.

All CUDA libraries operate directly on the GPU pointers. Here is an example of 
calling a single-precision GEMM with JCuda:

// Allocate memory on the device
Pointer d_A = new Pointer();
Pointer d_B = new Pointer();
Pointer d_C = new Pointer();
cudaMalloc(d_A, n * n * Sizeof.FLOAT);
cudaMalloc(d_B, n * n * Sizeof.FLOAT);
cudaMalloc(d_C, n * n * Sizeof.FLOAT);

// Copy the memory

// Execute sgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
           pAlpha, d_A, n, d_B, n, pBeta, d_C, n);

Most of existing sparse matrix classes and sparse matrix conversion routines in 
Mahout can generally maintain their structure as the CSR format is 
well-supported in both cuSPARSE and cuSOLVER libraries.

Our plan is to create a proof-of-concept implementation first to demonstrate 
matrix-matrix and/or matrix-vector multiplication using CUDA libraries, then 
expand functionality by adding more BLAS operations and advanced algorithms 
that exist in cuSOLVER. Stay tuned for more updates!

Regards,
Nikolai.


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to