Hey all,
I started a half thought on slack the other day, got pulled away before I could 
finish.  This is just a quick writeup, which needs more citations and empirical 
evidence to back up the reasoning, along with some examples.

I was looking at some FPGA work, and had a thought.  We have two ViennaCL 
modules in the experimental section of the project, based on the ViennaCL 
Abstract BLAS library[1][2]: mahout-viennacl-omp_2.x and mahout-viennacl_2.x 
which parallelize routines on shared memory CPU cores via OpenMP [2], and 
offload routines to GPU using OpenCL[4] respectively.

We've had good luck running the OpenMP module on off- heap main memory, showing 
significant speedups in Level II and level III BLAS operations, both sparse and 
dense matrix algebra, which provide speedups on algorithms like DSSVD and DSPCA 
[?], which are iterative and heavily dependent on Matrix-Matrix multiplication. 
 This module, with some clean-up of the module, itself [5], and the 
auto-probing logic [6].  While there is still some work to be done on memory 
management in this module, it is trivial, i.e. ensure minimal memory copies are 
made during an evaluation trigger [7].  This module could be tuned, cleaned and 
me ready for testing, to allow it back in the production environment.

However the the ViennaCL OpenCL module is still very much in an expiremental 
phase.  While significant speedups have been shown in both sparse and dense 
matrix algebra [8]; up to 30x on trivial matrix matrix algebra problems, where 
matrix blocks per node fit into a node's GPU memory, no memory tiling or 
scatter/gather routines are implemented for Multiple GPUs on a node [9] at the 
time of this writhing, and memory management is naive, with a copy from of-heap 
to GPU memory at each operation; a memory management scheme using 
synchronization to main memory operands and solutions has been proposed [9], 
but not implemented therefore the iterative evaluation of e.g.

     drmA := ...
     drmB  := ...
      for i <- 1..10 do
             drmC := mxRAND(b.ncol,100)
             drmA <- drmB * drmC
      done

would incur 10 copies per block of both mxA and  mxB back end blocks to and 
from the GPU.

Considering that a tiling strategy is not yet implemented for multiple GPUs, 
there is signigicant work to do to get this mpodule up the state of the art for 
a Spark backend.

Furthermore, work on CUDA backend has already begun, with a branch, 
apache/mahout:[CUDA], which makes use of the JCuda library, directly 
interfacing with the CuBLAS, CuSparse and other CUDA Libraries, which have had 
much work in memory management, shared memory, and optimized routines in the 
years since the ViennaCL library were written.

It was decided in [xx] that further GPU work (For NVIDIA libraries), and 
ViennaCL has been moved into the expiremental sub-project.

This leaves us with a ViennaCL library, partly finished, which could be useful 
still for AMD GPUs, however OpenCL itself being an abstraction layer leaves 
open the opportunities to begin experimental work on some other hardware. e.g 
FPGA[9][10]s.

Prpoposal:
Finish out basis GPU Blas routines in the apache/mahout:[CUDA] branch, and 
repurpose the ViennaCL for FPGAs pipelines on FPGA[10][11]. fBLAS demonstrates 
the offloading matrix math OpenCL Kernels to FPGA Matrix math backends [12].   
While  fBLAS is an example the Khronos group has enlisted a large group of 
hardware and software venodoors to adopth the OpenCL standard[13].

For an example of Khornos Groups Vector Add HPP OpenCL kernel see [14].

Much investigation would need to be done into optimal methods of FPGA 
prgrogramming.  E.g would it be possible to "Jit" pipelines as set forth in 
fBLAS (generate OpenCL kernels from tree at evaluation time after optimization 
to native code and compile then stream data through?)

Or in the case of e.g  a NN, optimize iteratively, and deliver a finished, set 
of generated code, or even set of programmed FPGAs derived from a mahout 
algorithm, optimized on a distributed dataset?

These are just thoughts of a way to move forward with an OpenCL Module, which 
we have in the experimental sub project currently.

Work has been put forward in the Apache environment [15], and translations from 
tensorflow to ViennaCL with the inten of extending to a generalized FPGA 
framework can be seen here[16] .

Intel's OneAPI[20]. while a reach is something that could be explored.

AWS FPGA instances[21]


[1] http://viennacl.sourceforge.net/
[2]http://viennacl.sourceforge.net/doc/manual-installation.html#manual-installation-backends
[3] https://www.openmp.org//
[4]https://www.khronos.org/opencl/
[5] MAHOUT-XXXX
[6] MAHOUT-XXXX
[7]https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/jni_sync.html
[8] Mahout ViennaCL Bindings#Memory Syncronizatrion
[xx]
[9]https://www.khronos.org/opencl/resources
[10]https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl_getting_started.pdf
[11]https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf
[12] https://htor.inf.ethz.ch/publications/img/fblas.pdf
[13]https://www.khronos.org/opencl/
[14]https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/examples/src/trivial.cpp#L78
[17]https://www.intel.com/content/www/us/en/products/docs/storage/programmable/applications/data-analytics.html
[18]https://www.khronos.org/assets/uploads/developers/library/2018-khronos-group-opencl-embedded-outreach/Taipei-DSP-Profile-NTHU_Jan18.pdf
[19]https://www.intel.com/content/www/us/en/artificial-intelligence/programmable/fpga-gpu.html
[20]https://software.intel.com/content/www/us/en/develop/tools/oneapi.html
[21]https://aws.amazon.com/ec2/instance-types/f1/





Reply via email to