RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander Mon, 09 Feb 2015 16:51:22 -0800

Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.


Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>

Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas |
+-----------------------------------------------------------------------+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose.

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley 
[mailto:jos...@databricks.com<mailto:jos...@databricks.com>]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
 wrote:
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: 
dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to 
make this work really fast from Scala. I've run it on my laptop and compared to 
MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of 
layers of indirection here and you really want to avoid data copying as much as 
possible.

We could also consider swapping out BIDMat for Breeze, but that would be a big 
project and if we can figure out how to get breeze+cublas to comparable 
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
 wrote:
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. 
Then I did some performance measurements with regards to artificial neural 
network batch learning in Spark MLlib that involves matrix-matrix 
multiplications. It turns out that for matrices of size less than ~1000x780 GPU 
cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
matrices. It worth mentioning that it is was not a test for ONLY multiplication 
since there are other operations involved. One of the reasons for slowdown 
might be the overhead of copying the matrices from computer memory to graphic 
card memory and back.

So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to 
force intermediate results to stay in graphic card memory thus removing the 
overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>

RE: Using CUDA within Spark / boosting linear algebra

Reply via email to