I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Hi Evan, > > > > Thank you for suggestion! BIDMat seems to have terrific speed. Do you know > what makes them faster than netlib-java? > > > > The same group has BIDMach library that implements machine learning. For > some examples they use Caffe convolutional neural network library owned by > another group in Berkeley. Could you elaborate on how these all might be > connected with Spark Mllib? If you take BIDMat for linear algebra why don’t > you take BIDMach for optimization and learning? > > > > Best regards, Alexander > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Thursday, February 05, 2015 12:09 PM > *To:* Ulanov, Alexander > *Cc:* dev@spark.apache.org > *Subject:* Re: Using CUDA within Spark / boosting linear algebra > > > > I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in > many cases. > > > > You might consider taking a look at the codepaths that BIDMat ( > https://github.com/BIDData/BIDMat) takes and comparing them to > netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing > to make this work really fast from Scala. I've run it on my laptop and > compared to MKL and in certain cases it's 10x faster at matrix multiply. > There are a lot of layers of indirection here and you really want to avoid > data copying as much as possible. > > > > We could also consider swapping out BIDMat for Breeze, but that would be a > big project and if we can figure out how to get breeze+cublas to comparable > performance that would be a big win. > > > > On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < > alexander.ula...@hp.com> wrote: > > Dear Spark developers, > > I am exploring how to make linear algebra operations faster within Spark. > One way of doing this is to use Scala Breeze library that is bundled with > Spark. For matrix operations, it employs Netlib-java that has a Java > wrapper for BLAS (basic linear algebra subprograms) and LAPACK native > binaries if they are available on the worker node. It also has its own > optimized Java implementation of BLAS. It is worth mentioning, that native > binaries provide better performance only for BLAS level 3, i.e. > matrix-matrix operations or general matrix multiplication (GEMM). This is > confirmed by GEMM test on Netlib-java page > https://github.com/fommil/netlib-java. I also confirmed it with my > experiments with training of artificial neural network > https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, > I would like to boost performance more. > > GPU is supposed to work fast with linear algebra and there is Nvidia CUDA > implementation of BLAS, called cublas. I have one Linux server with Nvidia > GPU and I was able to do the following. I linked cublas (instead of > cpu-based blas) with Netlib-java wrapper and put it into Spark, so > Breeze/Netlib is using it. Then I did some performance measurements with > regards to artificial neural network batch learning in Spark MLlib that > involves matrix-matrix multiplications. It turns out that for matrices of > size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas > becomes slower for bigger matrices. It worth mentioning that it is was not > a test for ONLY multiplication since there are other operations involved. > One of the reasons for slowdown might be the overhead of copying the > matrices from computer memory to graphic card memory and back. > > So, few questions: > 1) Do these results with CUDA make sense? > 2) If the problem is with copy overhead, are there any libraries that > allow to force intermediate results to stay in graphic card memory thus > removing the overhead? > 3) Any other options to speed-up linear algebra in Spark? > > Thank you, Alexander > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > >