Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to 
make this work really fast from Scala. I've run it on my laptop and compared to 
MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of 
layers of indirection here and you really want to avoid data copying as much as 
possible.

We could also consider swapping out BIDMat for Breeze, but that would be a big 
project and if we can figure out how to get breeze+cublas to comparable 
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. 
Then I did some performance measurements with regards to artificial neural 
network batch learning in Spark MLlib that involves matrix-matrix 
multiplications. It turns out that for matrices of size less than ~1000x780 GPU 
cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
matrices. It worth mentioning that it is was not a test for ONLY multiplication 
since there are other operations involved. One of the reasons for slowdown 
might be the overhead of copying the matrices from computer memory to graphic 
card memory and back.

So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to 
force intermediate results to stay in graphic card memory thus removing the 
overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>

Reply via email to