Hey all, I started a half thought on slack the other day, got pulled away before I could finish. This is just a quick writeup, which needs more citations and empirical evidence to back up the reasoning, along with some examples.
I was looking at some FPGA work, and had a thought. We have two ViennaCL modules in the experimental section of the project, based on the ViennaCL Abstract BLAS library[1][2]: mahout-viennacl-omp_2.x and mahout-viennacl_2.x which parallelize routines on shared memory CPU cores via OpenMP [2], and offload routines to GPU using OpenCL[4] respectively. We've had good luck running the OpenMP module on off- heap main memory, showing significant speedups in Level II and level III BLAS operations, both sparse and dense matrix algebra, which provide speedups on algorithms like DSSVD and DSPCA [?], which are iterative and heavily dependent on Matrix-Matrix multiplication. This module, with some clean-up of the module, itself [5], and the auto-probing logic [6]. While there is still some work to be done on memory management in this module, it is trivial, i.e. ensure minimal memory copies are made during an evaluation trigger [7]. This module could be tuned, cleaned and me ready for testing, to allow it back in the production environment. However the the ViennaCL OpenCL module is still very much in an expiremental phase. While significant speedups have been shown in both sparse and dense matrix algebra [8]; up to 30x on trivial matrix matrix algebra problems, where matrix blocks per node fit into a node's GPU memory, no memory tiling or scatter/gather routines are implemented for Multiple GPUs on a node [9] at the time of this writhing, and memory management is naive, with a copy from of-heap to GPU memory at each operation; a memory management scheme using synchronization to main memory operands and solutions has been proposed [9], but not implemented therefore the iterative evaluation of e.g. drmA := ... drmB := ... for i <- 1..10 do drmC := mxRAND(b.ncol,100) drmA <- drmB * drmC done would incur 10 copies per block of both mxA and mxB back end blocks to and from the GPU. Considering that a tiling strategy is not yet implemented for multiple GPUs, there is signigicant work to do to get this mpodule up the state of the art for a Spark backend. Furthermore, work on CUDA backend has already begun, with a branch, apache/mahout:[CUDA], which makes use of the JCuda library, directly interfacing with the CuBLAS, CuSparse and other CUDA Libraries, which have had much work in memory management, shared memory, and optimized routines in the years since the ViennaCL library were written. It was decided in [xx] that further GPU work (For NVIDIA libraries), and ViennaCL has been moved into the expiremental sub-project. This leaves us with a ViennaCL library, partly finished, which could be useful still for AMD GPUs, however OpenCL itself being an abstraction layer leaves open the opportunities to begin experimental work on some other hardware. e.g FPGA[9][10]s. Prpoposal: Finish out basis GPU Blas routines in the apache/mahout:[CUDA] branch, and repurpose the ViennaCL for FPGAs pipelines on FPGA[10][11]. fBLAS demonstrates the offloading matrix math OpenCL Kernels to FPGA Matrix math backends [12]. While fBLAS is an example the Khronos group has enlisted a large group of hardware and software venodoors to adopth the OpenCL standard[13]. For an example of Khornos Groups Vector Add HPP OpenCL kernel see [14]. Much investigation would need to be done into optimal methods of FPGA prgrogramming. E.g would it be possible to "Jit" pipelines as set forth in fBLAS (generate OpenCL kernels from tree at evaluation time after optimization to native code and compile then stream data through?) Or in the case of e.g a NN, optimize iteratively, and deliver a finished, set of generated code, or even set of programmed FPGAs derived from a mahout algorithm, optimized on a distributed dataset? These are just thoughts of a way to move forward with an OpenCL Module, which we have in the experimental sub project currently. Work has been put forward in the Apache environment [15], and translations from tensorflow to ViennaCL with the inten of extending to a generalized FPGA framework can be seen here[16] . Intel's OneAPI[20]. while a reach is something that could be explored. AWS FPGA instances[21] [1] http://viennacl.sourceforge.net/ [2]http://viennacl.sourceforge.net/doc/manual-installation.html#manual-installation-backends [3] https://www.openmp.org// [4]https://www.khronos.org/opencl/ [5] MAHOUT-XXXX [6] MAHOUT-XXXX [7]https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/jni_sync.html [8] Mahout ViennaCL Bindings#Memory Syncronizatrion [xx] [9]https://www.khronos.org/opencl/resources [10]https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl_getting_started.pdf [11]https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf [12] https://htor.inf.ethz.ch/publications/img/fblas.pdf [13]https://www.khronos.org/opencl/ [14]https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/examples/src/trivial.cpp#L78 [17]https://www.intel.com/content/www/us/en/products/docs/storage/programmable/applications/data-analytics.html [18]https://www.khronos.org/assets/uploads/developers/library/2018-khronos-group-opencl-embedded-outreach/Taipei-DSP-Profile-NTHU_Jan18.pdf [19]https://www.intel.com/content/www/us/en/artificial-intelligence/programmable/fpga-gpu.html [20]https://software.intel.com/content/www/us/en/develop/tools/oneapi.html [21]https://aws.amazon.com/ec2/instance-types/f1/